All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device
@ 2018-11-20 20:39 Kirti Wankhede
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Kirti Wankhede
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-20 20:39 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	qemu-devel, Kirti Wankhede

Add migration support for VFIO device

This Patch set include patches as below:
- Define KABI for VFIO device for migration support.
- Added save and restore functions for PCI configuration space
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending, .save_live_iterate and .is_active_iterate are
    implemented to use QEMU's functionality of iteration during pre-copy
    phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.
  * .save_cleanup and .load_cleanup are implemented to unmap migration
    region that was setup duing setup phase.
  * Added function to get dirty pages bitmap for the pages which are used by
    driver.
- Add vfio_listerner_log_sync to mark dirty pages.
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _MIGRATION_SETUP)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _MIGRATION_PRECOPY)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    migration_completion() stops vCPUs and calls .save_live_complete_precopy
    for each active  device. VFIO device is then transitioned in
     _MIGRATION_STOPNCOPY state.
    (FINISH_MIGRATE, _DEVICE, _MIGRATION_STOPNCOPY)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0. Change VFIO device state.
    (FINISH_MIGRATE, _DEVICE, _MIGRATION_SAVE_COMPLETED)
                        |
    (FINISH_MIGRATE, _COMPLETED, _MIGRATION_SAVE_COMPLETED)
    Migraton thread schedule cleanup bottom half and exit
                        |
    (POST_MIGRATE, _COMPLETED, _MIGRATION_SAVE_COMPLETED)
    For each device, call .save_cleanup(). Unmap migration region.


Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, _MIGRATION_RESUME)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
    (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.
- VFIO device driver version compatibility is not taken care in this series.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti

Kirti Wankhede (5):
  VFIO KABI for migration interface
  Add save and load functions for VFIO PCI devices
  Add migration functions for VFIO devices
  Add vfio_listerner_log_sync to mark dirty pages
  Make vfio-pci device migration capable.

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  32 ++
 hw/vfio/migration.c           | 729 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 108 ++++++-
 hw/vfio/pci.h                 |  29 ++
 include/hw/vfio/vfio-common.h |  23 ++
 linux-headers/linux/vfio.h    | 130 ++++++++
 7 files changed, 1045 insertions(+), 8 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-20 20:39 [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Kirti Wankhede
@ 2018-11-20 20:39 ` Kirti Wankhede
  2018-11-21  0:26   ` Tian, Kevin
                     ` (4 more replies)
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 2/5] Add save and load functions for VFIO PCI devices Kirti Wankhede
                   ` (4 subsequent siblings)
  5 siblings, 5 replies; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-20 20:39 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	qemu-devel, Kirti Wankhede

- Defined MIGRATION region type and sub-type.
- Defined VFIO device states during migration process.
- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined actions and members of structure usage for each action:
    * To convey VFIO device state to be transitioned to.
    * To get pending bytes yet to be migrated for VFIO device
    * To ask driver to write data to migration region and return number of bytes
      written in the region
    * In migration resume path, user space app writes to migration region and
      communicates it to vendor driver.
    * Get bitmap of dirty pages from vendor driver from given start address

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 130 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 130 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 3615a269d378..a6e45cb2cae2 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
 
+/* Migration region type and sub-type */
+#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
+#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
@@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
 
 #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO device states :
+ * VFIO User space application should set the device state to indicate vendor
+ * driver in which state the VFIO device should transitioned.
+ * - VFIO_DEVICE_STATE_NONE:
+ *   State when VFIO device is initialized but not yet running.
+ * - VFIO_DEVICE_STATE_RUNNING:
+ *   Transition VFIO device in running state, that is, user space application or
+ *   VM is active.
+ * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
+ *   Transition VFIO device in migration setup state. This is used to prepare
+ *   VFIO device for migration while application or VM and vCPUs are still in
+ *   running state.
+ * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
+ *   When VFIO user space application or VM is active and vCPUs are running,
+ *   transition VFIO device in pre-copy state.
+ * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
+ *   When VFIO user space application or VM is stopped and vCPUs are halted,
+ *   transition VFIO device in stop-and-copy state.
+ * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
+ *   When VFIO user space application has copied data provided by vendor driver.
+ *   This state is used by vendor driver to clean up all software state that was
+ *   setup during MIGRATION_SETUP state.
+ * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
+ *   Transition VFIO device to resume state, that is, start resuming VFIO device
+ *   when user space application or VM is not running and vCPUs are halted.
+ * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
+ *   When user space application completes iterations of providing device state
+ *   data, transition device in resume completed state.
+ * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
+ *   Migration process failed due to some reason, transition device to failed
+ *   state. If migration process fails while saving at source, resume device at
+ *   source. If migration process fails while resuming application or VM at
+ *   destination, stop restoration at destination and resume at source.
+ * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
+ *   User space application has cancelled migration process either for some
+ *   known reason or due to user's intervention. Transition device to Cancelled
+ *   state, that is, resume device state as it was during running state at
+ *   source.
+ */
+
+enum {
+    VFIO_DEVICE_STATE_NONE,
+    VFIO_DEVICE_STATE_RUNNING,
+    VFIO_DEVICE_STATE_MIGRATION_SETUP,
+    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
+    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
+    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_FAILED,
+    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
+};
+
+/**
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information.
+ *
+ * Action Set state:
+ *      To tell vendor driver the state VFIO device should be transitioned to.
+ *      device_state [input] : User space app sends device state to vendor
+ *           driver on state change, the state to which VFIO device should be
+ *           transitioned to.
+ *
+ * Action Get pending bytes:
+ *      To get pending bytes yet to be migrated from vendor driver
+ *      pending.threshold_size [Input] : threshold of buffer in User space app.
+ *      pending.precopy_only [output] : pending data which must be migrated in
+ *          precopy phase or in stopped state, in other words - before target
+ *          user space application or VM start. In case of migration, this
+ *          indicates pending bytes to be transfered while application or VM or
+ *          vCPUs are active and running.
+ *      pending.compatible [output] : pending data which may be migrated any
+ *          time , either when application or VM is active and vCPUs are active
+ *          or when application or VM is halted and vCPUs are halted.
+ *      pending.postcopy_only [output] : pending data which must be migrated in
+ *           postcopy phase or in stopped state, in other words - after source
+ *           application or VM stopped and vCPUs are halted.
+ *      Sum of pending.precopy_only, pending.compatible and
+ *      pending.postcopy_only is the whole amount of pending data.
+ *
+ * Action Get buffer:
+ *      On this action, vendor driver should write data to migration region and
+ *      return number of bytes written in the region.
+ *      data.offset [output] : offset in the region from where data is written.
+ *      data.size [output] : number of bytes written in migration buffer by
+ *          vendor driver.
+ *
+ * Action Set buffer:
+ *      In migration resume path, user space app writes to migration region and
+ *      communicates it to vendor driver with this action.
+ *      data.offset [Input] : offset in the region from where data is written.
+ *      data.size [Input] : number of bytes written in migration buffer by
+ *          user space app.
+ *
+ * Action Get dirty pages bitmap:
+ *      Get bitmap of dirty pages from vendor driver from given start address.
+ *      dirty_pfns.start_addr [Input] : start address
+ *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
+ *          dirty bitmap is requested
+ *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is copied
+ *          to migration region.
+ *      Vendor driver should copy the bitmap with bits set only for pages to be
+ *      marked dirty in migration region.
+ */
+
+struct vfio_device_migration_info {
+        __u32 device_state;         /* VFIO device state */
+        struct {
+            __u64 precopy_only;
+            __u64 compatible;
+            __u64 postcopy_only;
+            __u64 threshold_size;
+        } pending;
+        struct {
+            __u64 offset;           /* offset */
+            __u64 size;             /* size */
+        } data;
+        struct {
+            __u64 start_addr;
+            __u64 total;
+            __u64 copied;
+        } dirty_pfns;
+} __attribute__((packed));
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 2/5] Add save and load functions for VFIO PCI devices
  2018-11-20 20:39 [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Kirti Wankhede
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Kirti Wankhede
@ 2018-11-20 20:39 ` Kirti Wankhede
  2018-11-21  5:32   ` Peter Xu
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices Kirti Wankhede
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-20 20:39 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	qemu-devel, Kirti Wankhede

Save and restore with MSIX type is not tested.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h | 29 ++++++++++++++++++
 2 files changed, 124 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6cbb8fa0549d..72daf1a358a0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1234,6 +1234,101 @@ void vfio_pci_write_config(PCIDevice *pdev,
     }
 }
 
+void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    int i;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar;
+
+        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+        qemu_put_be32(f, bar);
+    }
+
+    qemu_put_be32(f, vdev->interrupt);
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+
+        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                                            2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        qemu_put_be32(f, msi_addr_lo);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                             4);
+        }
+        qemu_put_be32(f, msi_addr_hi);
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+        qemu_put_be32(f, msi_data);
+    } else if (vdev->interrupt == VFIO_INT_MSIX) {
+        msix_save(pdev, f);
+    }
+}
+
+void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t pci_cmd, interrupt_type;
+    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+    bool msi_64bit;
+    int i;
+
+    /* retore pci bar configuration */
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar = qemu_get_be32(f);
+
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+    }
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
+
+    interrupt_type = qemu_get_be32(f);
+
+    if (interrupt_type == VFIO_INT_MSI) {
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+        msi_addr_lo = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo, 4);
+
+        msi_addr_hi = qemu_get_be32(f);
+        if (msi_64bit) {
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+        msi_data = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                msi_data, 2);
+
+        vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
+    } else if (vdev->interrupt == VFIO_INT_MSIX) {
+        msix_load(pdev, f);
+    }
+}
+
 /*
  * Interrupt setup
  */
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 52b065421a68..890d77d66a6b 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -20,6 +20,7 @@
 #include "qemu/queue.h"
 #include "qemu/timer.h"
 
+#ifdef CONFIG_LINUX
 #define PCI_ANY_ID (~0)
 
 struct VFIOPCIDevice;
@@ -198,4 +199,32 @@ void vfio_display_reset(VFIOPCIDevice *vdev);
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
 void vfio_display_finalize(VFIOPCIDevice *vdev);
 
+void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f);
+void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f);
+
+static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+    return OBJECT(vdev);
+}
+
+#else
+static inline void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    g_assert(false);
+}
+
+static inline void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    g_assert(false);
+}
+
+static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    return NULL;
+}
+
+#endif
+
 #endif /* HW_VFIO_VFIO_PCI_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-11-20 20:39 [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Kirti Wankhede
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Kirti Wankhede
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 2/5] Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2018-11-20 20:39 ` Kirti Wankhede
  2018-11-21  7:39   ` Zhao, Yan Y
                     ` (4 more replies)
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 4/5] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
                   ` (2 subsequent siblings)
  5 siblings, 5 replies; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-20 20:39 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	qemu-devel, Kirti Wankhede

- Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
- Added SaveVMHandlers and implemented all basic functions required for live
  migration.
- Added VM state change handler to know running or stopped state of VM.
- Added migration state change notifier to get notification on migration state
  change. This state is translated to VFIO device state and conveyed to vendor
  driver.
- VFIO device supportd migration or not is decided based of migration region
  query. If migration region query is successful then migration is supported
  else migration is blocked.
- Structure vfio_device_migration_info is mapped at 0th offset of migration
  region and should always trapped by VFIO device's driver. Added both type of
  access support, trapped or mmapped, for data section of the region.
- To save device state, read data offset and size using structure
  vfio_device_migration_info.data, accordingly copy data from the region.
- To restore device state, write data offset and size in the structure and write
  data in the region.
- To get dirty page bitmap, write start address and pfn count then read count of
  pfns copied and accordingly read those from the rest of the region or mmaped
  part of the region. This copy is iterated till page bitmap for all requested
  pfns are copied.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 729 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  23 ++
 3 files changed, 753 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index a2e7a0a7cf02..2cf2ba1440f2 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,5 +1,5 @@
 ifeq ($(CONFIG_LINUX), y)
-obj-$(CONFIG_SOFTMMU) += common.o
+obj-$(CONFIG_SOFTMMU) += common.o migration.o
 obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_SOFTMMU) += platform.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..717fb63e4f43
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,729 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2018
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.buffer.size) {
+        vfio_region_exit(&migration->region.buffer);
+        vfio_region_finalize(&migration->region.buffer);
+    }
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    Object *obj = NULL;
+    int ret = -EINVAL;
+
+    if (!migration) {
+        return ret;
+    }
+
+    /* Migration support added for PCI device only */
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        obj = vfio_pci_get_object(vbasedev);
+    }
+
+    if (!obj) {
+        return ret;
+    }
+
+    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
+                            migration->region.index, "migration");
+    if (ret) {
+        error_report("Failed to setup VFIO migration region %d: %s",
+                      migration->region.index, strerror(-ret));
+        goto err;
+    }
+
+    if (!migration->region.buffer.size) {
+        ret = -EINVAL;
+        error_report("Invalid region size of VFIO migration region %d: %s",
+                     migration->region.index, strerror(-ret));
+        goto err;
+    }
+
+    if (migration->region.buffer.mmaps) {
+        ret = vfio_region_mmap(&migration->region.buffer);
+        if (ret) {
+            error_report("Failed to mmap VFIO migration region %d: %s",
+                         migration->region.index, strerror(-ret));
+            goto err;
+        }
+    }
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    int ret = 0;
+
+    if (vbasedev->device_state == state) {
+        return ret;
+    }
+
+    ret = pwrite(vbasedev->fd, &state, sizeof(state),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("Failed to set migration state %d %s",
+                     ret, strerror(errno));
+        return ret;
+    }
+
+    vbasedev->device_state = state;
+    return ret;
+}
+
+void vfio_get_dirty_page_list(VFIODevice *vbasedev,
+                              uint64_t start_addr,
+                              uint64_t pfn_count)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    struct vfio_device_migration_info migration_info;
+    uint64_t count = 0;
+    int ret;
+
+    migration_info.dirty_pfns.start_addr = start_addr;
+    migration_info.dirty_pfns.total = pfn_count;
+
+    ret = pwrite(vbasedev->fd, &migration_info.dirty_pfns,
+                 sizeof(migration_info.dirty_pfns),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              dirty_pfns));
+    if (ret < 0) {
+        error_report("Failed to set dirty pages start address %d %s",
+                ret, strerror(errno));
+        return;
+    }
+
+    do {
+        /* Read dirty_pfns.copied */
+        ret = pread(vbasedev->fd, &migration_info.dirty_pfns,
+                sizeof(migration_info.dirty_pfns),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             dirty_pfns));
+        if (ret < 0) {
+            error_report("Failed to get dirty pages bitmap count %d %s",
+                    ret, strerror(errno));
+            return;
+        }
+
+        if (migration_info.dirty_pfns.copied) {
+            uint64_t bitmap_size;
+            void *buf = NULL;
+            bool buffer_mmaped = false;
+
+            bitmap_size = (BITS_TO_LONGS(migration_info.dirty_pfns.copied) + 1)
+                           * sizeof(unsigned long);
+
+            if (region->mmaps) {
+                int i;
+
+                for (i = 0; i < region->nr_mmaps; i++) {
+                    if (region->mmaps[i].size >= bitmap_size) {
+                        buf = region->mmaps[i].mmap;
+                        buffer_mmaped = true;
+                        break;
+                    }
+                }
+            }
+
+            if (!buffer_mmaped) {
+                buf = g_malloc0(bitmap_size);
+
+                ret = pread(vbasedev->fd, buf, bitmap_size,
+                            region->fd_offset + sizeof(migration_info) + 1);
+                if (ret != bitmap_size) {
+                    error_report("Failed to get migration data %d", ret);
+                    g_free(buf);
+                    return;
+                }
+            }
+
+            cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
+                                        start_addr + (count * TARGET_PAGE_SIZE),
+                                        migration_info.dirty_pfns.copied);
+            count +=  migration_info.dirty_pfns.copied;
+
+            if (!buffer_mmaped) {
+                g_free(buf);
+            }
+        }
+    } while (count < migration_info.dirty_pfns.total);
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        vfio_pci_save_config(vbasedev, f);
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    return qemu_file_get_error(f);
+}
+
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        vfio_pci_load_config(vbasedev, f);
+    }
+
+    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("Wrong end of block while loading device config space");
+        return -EINVAL;
+    }
+
+    return qemu_file_get_error(f);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vbasedev->vm_running && vbasedev->migration &&
+        (vbasedev->migration->pending_precopy_only != 0))
+        return true;
+
+    if (!vbasedev->vm_running && vbasedev->migration &&
+        (vbasedev->migration->pending_postcopy != 0))
+        return true;
+
+    return false;
+}
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    qemu_mutex_lock_iothread();
+    ret = vfio_migration_region_init(vbasedev);
+    qemu_mutex_unlock_iothread();
+    if (ret) {
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    struct vfio_device_migration_info migration_info;
+    int ret;
+
+    ret = pread(vbasedev->fd, &migration_info.data,
+                sizeof(migration_info.data),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data));
+    if (ret != sizeof(migration_info.data)) {
+        error_report("Failed to get migration buffer information %d",
+                     ret);
+        return -EINVAL;
+    }
+
+    if (migration_info.data.size) {
+        void *buf = NULL;
+        bool buffer_mmaped = false;
+
+        if (region->mmaps) {
+            int i;
+
+            for (i = 0; i < region->nr_mmaps; i++) {
+                if (region->mmaps[i].offset == migration_info.data.offset) {
+                    buf = region->mmaps[i].mmap;
+                    buffer_mmaped = true;
+                    break;
+                }
+            }
+        }
+
+        if (!buffer_mmaped) {
+            buf = g_malloc0(migration_info.data.size);
+            ret = pread(vbasedev->fd, buf, migration_info.data.size,
+                        region->fd_offset + migration_info.data.offset);
+            if (ret != migration_info.data.size) {
+                error_report("Failed to get migration data %d", ret);
+                return -EINVAL;
+            }
+        }
+
+        qemu_put_be64(f, migration_info.data.size);
+        qemu_put_buffer(f, buf, migration_info.data.size);
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+
+    } else {
+        qemu_put_be64(f, migration_info.data.size);
+    }
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return migration_info.data.size;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    ret = vfio_save_buffer(f, vbasedev);
+    if (ret < 0) {
+        error_report("vfio_save_buffer failed %s",
+                     strerror(errno));
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return ret;
+}
+
+static void vfio_update_pending(VFIODevice *vbasedev, uint64_t threshold_size)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    struct vfio_device_migration_info migration_info;
+    int ret;
+
+    ret = pwrite(vbasedev->fd, &threshold_size, sizeof(threshold_size),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              pending.threshold_size));
+    if (ret < 0) {
+        error_report("Failed to set threshold size %d %s",
+                     ret, strerror(errno));
+        return;
+    }
+
+    ret = pread(vbasedev->fd, &migration_info.pending,
+                sizeof(migration_info.pending),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending));
+    if (ret != sizeof(migration_info.pending)) {
+        error_report("Failed to get pending bytes %d", ret);
+        return;
+    }
+
+    migration->pending_precopy_only = migration_info.pending.precopy_only;
+    migration->pending_compatible = migration_info.pending.compatible;
+    migration->pending_postcopy = migration_info.pending.postcopy_only;
+
+    return;
+}
+
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    vfio_update_pending(vbasedev, threshold_size);
+
+    *res_precopy_only += migration->pending_precopy_only;
+    *res_compatible += migration->pending_compatible;
+    *res_postcopy_only += migration->pending_postcopy;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    MigrationState *ms = migrate_get_current();
+    int ret;
+
+    if (vbasedev->vm_running) {
+        vbasedev->vm_running = 0;
+    }
+
+    ret = vfio_migration_set_state(vbasedev,
+                                   VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
+    if (ret) {
+        error_report("Failed to set state STOPNCOPY_ACTIVE");
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    do {
+        vfio_update_pending(vbasedev, ms->threshold_size);
+
+        if (vfio_is_active_iterate(opaque)) {
+            ret = vfio_save_buffer(f, vbasedev);
+            if (ret < 0) {
+                error_report("Failed to save buffer");
+                break;
+            } else if (ret == 0) {
+                break;
+            }
+        }
+    } while ((migration->pending_compatible + migration->pending_postcopy) > 0);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev,
+                                   VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED);
+    if (ret) {
+        error_report("Failed to set state SAVE_COMPLETED");
+        return ret;
+    }
+    return ret;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    vfio_migration_region_exit(vbasedev);
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+    uint64_t data;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return 0;
+            } else {
+                error_report("SETUP STATE: EOS not found 0x%lx", data);
+                return -EINVAL;
+            }
+        } else if (data != 0) {
+            VFIOMigration *migration = vbasedev->migration;
+            VFIORegion *region = &migration->region.buffer;
+            struct vfio_device_migration_info migration_info;
+            void *buf = NULL;
+            bool buffer_mmaped = false;
+
+            migration_info.data.size = data;
+
+            if (region->mmaps) {
+                int i;
+
+                for (i = 0; i < region->nr_mmaps; i++) {
+                    if (region->mmaps[i].mmap &&
+                        (region->mmaps[i].size >= data)) {
+                        buf = region->mmaps[i].mmap;
+                        migration_info.data.offset = region->mmaps[i].offset;
+                        buffer_mmaped = true;
+                        break;
+                    }
+                }
+            }
+
+            if (!buffer_mmaped) {
+                buf = g_malloc0(migration_info.data.size);
+                migration_info.data.offset = sizeof(migration_info) + 1;
+            }
+
+            qemu_get_buffer(f, buf, data);
+
+            ret = pwrite(vbasedev->fd, &migration_info.data,
+                         sizeof(migration_info.data),
+                         region->fd_offset +
+                         offsetof(struct vfio_device_migration_info, data));
+            if (ret != sizeof(migration_info.data)) {
+                error_report("Failed to set migration buffer information %d",
+                        ret);
+                return -EINVAL;
+            }
+
+            if (!buffer_mmaped) {
+                ret = pwrite(vbasedev->fd, buf, migration_info.data.size,
+                             region->fd_offset + migration_info.data.offset);
+                g_free(buf);
+
+                if (ret != migration_info.data.size) {
+                    error_report("Failed to set migration buffer %d", ret);
+                    return -EINVAL;
+                }
+            }
+        }
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+        data = qemu_get_be64(f);
+    }
+
+    return 0;
+}
+
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev,
+                                    VFIO_DEVICE_STATE_MIGRATION_RESUME);
+    if (ret) {
+        error_report("Failed to set state RESUME");
+    }
+
+    ret = vfio_migration_region_init(vbasedev);
+    if (ret) {
+        error_report("Failed to initialise migration region");
+        return ret;
+    }
+
+    return 0;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret = 0;
+
+    ret = vfio_migration_set_state(vbasedev,
+                                 VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED);
+    if (ret) {
+        error_report("Failed to set state RESUME_COMPLETED");
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_live_pending = vfio_save_pending,
+    .save_cleanup = vfio_save_cleanup,
+    .load_state = vfio_load_state,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .is_active_iterate = vfio_is_active_iterate,
+};
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running) && running) {
+        int ret;
+
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("Failed to set state RUNNING");
+        }
+    }
+
+    vbasedev->vm_running = running;
+}
+
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    switch (s->state) {
+    case MIGRATION_STATUS_SETUP:
+        ret = vfio_migration_set_state(vbasedev,
+                                       VFIO_DEVICE_STATE_MIGRATION_SETUP);
+        if (ret) {
+            error_report("Failed to set state SETUP");
+        }
+        return;
+
+    case MIGRATION_STATUS_ACTIVE:
+        if (vbasedev->device_state == VFIO_DEVICE_STATE_MIGRATION_SETUP) {
+            if (vbasedev->vm_running) {
+                ret = vfio_migration_set_state(vbasedev,
+                                          VFIO_DEVICE_STATE_MIGRATION_PRECOPY);
+                if (ret) {
+                    error_report("Failed to set state PRECOPY_ACTIVE");
+                }
+            } else {
+                ret = vfio_migration_set_state(vbasedev,
+                                        VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
+                if (ret) {
+                    error_report("Failed to set state STOPNCOPY_ACTIVE");
+                }
+            }
+        } else {
+            ret = vfio_migration_set_state(vbasedev,
+                                           VFIO_DEVICE_STATE_MIGRATION_RESUME);
+            if (ret) {
+                error_report("Failed to set state RESUME");
+            }
+        }
+        return;
+
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+        ret = vfio_migration_set_state(vbasedev,
+                                       VFIO_DEVICE_STATE_MIGRATION_CANCELLED);
+        if (ret) {
+            error_report("Failed to set state CANCELLED");
+        }
+        return;
+
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev,
+                                       VFIO_DEVICE_STATE_MIGRATION_FAILED);
+        if (ret) {
+            error_report("Failed to set state FAILED");
+        }
+        return;
+    }
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+                               struct vfio_region_info *info)
+{
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+    vbasedev->migration->region.index = info->index;
+
+    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
+
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
+
+    return 0;
+}
+
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_region_info *info;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    if (ret) {
+        Error *local_err = NULL;
+
+        error_setg(&vbasedev->migration_blocker,
+                   "VFIO device doesn't support migration");
+        ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            error_free(vbasedev->migration_blocker);
+            return ret;
+        }
+    } else {
+        return vfio_migration_init(vbasedev, info);
+    }
+
+    return 0;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (!vbasedev->migration) {
+        return;
+    }
+
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+    }
+
+    g_free(vbasedev->migration);
+}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a9036929b220..ab8217c9e249 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -30,6 +30,8 @@
 #include <linux/vfio.h>
 #endif
 
+#include "sysemu/sysemu.h"
+
 #define ERR_PREFIX "vfio error: %s: "
 #define WARN_PREFIX "vfio warning: %s: "
 
@@ -57,6 +59,16 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    struct {
+        VFIORegion buffer;
+        uint32_t index;
+    } region;
+    uint64_t pending_precopy_only;
+    uint64_t pending_compatible;
+    uint64_t pending_postcopy;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -116,6 +128,12 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    uint32_t device_state;
+    VMChangeStateEntry *vm_state;
+    int vm_running;
+    Notifier migration_state;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -193,4 +211,9 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_addr,
+                               uint64_t pfn_count);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 4/5] Add vfio_listerner_log_sync to mark dirty pages
  2018-11-20 20:39 [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Kirti Wankhede
                   ` (2 preceding siblings ...)
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices Kirti Wankhede
@ 2018-11-20 20:39 ` Kirti Wankhede
  2018-11-22 20:00   ` Dr. David Alan Gilbert
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 5/5] Make vfio-pci device migration capable Kirti Wankhede
  2018-11-21  5:47 ` [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Peter Xu
  5 siblings, 1 reply; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-20 20:39 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	qemu-devel, Kirti Wankhede

vfio_listerner_log_sync gets list of dirty pages from vendor driver and mark
those pages dirty.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb396cf00ac4..338aad7426f0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -697,9 +697,41 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_listerner_log_sync(MemoryListener *listener,
+                                    MemoryRegionSection *section)
+{
+    uint64_t start_addr, size, pfn_count;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            switch (vbasedev->device_state) {
+            case VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
+            case VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
+                    continue;
+
+            default:
+                    return;
+            }
+        }
+    }
+
+    start_addr = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    size = int128_get64(section->size);
+    pfn_count = size >> TARGET_PAGE_BITS;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vfio_get_dirty_page_list(vbasedev, start_addr, pfn_count);
+        }
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 5/5] Make vfio-pci device migration capable.
  2018-11-20 20:39 [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Kirti Wankhede
                   ` (3 preceding siblings ...)
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 4/5] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
@ 2018-11-20 20:39 ` Kirti Wankhede
  2018-11-21  5:47 ` [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Peter Xu
  5 siblings, 0 replies; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-20 20:39 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	qemu-devel, Kirti Wankhede

Call vfio_migration_probe() and vfio_migration_finalize() functions for
vfio-pci device to enable migration for vfio PCI device.
Removed vfio_pci_vmstate structure.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 72daf1a358a0..0f9d06981b1b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2930,6 +2930,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = &vdev->pdev.qdev;
+    vdev->vbasedev.device_state = VFIO_DEVICE_STATE_NONE;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3141,10 +3142,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    ret = vfio_migration_probe(&vdev->vbasedev, errp);
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
-
     return;
 
 out_teardown:
@@ -3180,6 +3182,8 @@ static void vfio_exitfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
 
+    vdev->vbasedev.device_state = VFIO_DEVICE_STATE_NONE;
+
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
@@ -3189,6 +3193,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
@@ -3294,11 +3299,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3306,7 +3306,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Kirti Wankhede
@ 2018-11-21  0:26   ` Tian, Kevin
  2018-11-21  4:24     ` Kirti Wankhede
  2018-11-21 17:26   ` Pierre Morel
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Tian, Kevin @ 2018-11-21  0:26 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: Yang, Ziye, Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet,
	cohuck, dgilbert, jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang, Zhi A, qemu-devel

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, November 21, 2018 4:40 AM
> 
> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined actions and members of structure usage for each action:
>     * To convey VFIO device state to be transitioned to.
>     * To get pending bytes yet to be migrated for VFIO device
>     * To ask driver to write data to migration region and return number of
> bytes
>       written in the region
>     * In migration resume path, user space app writes to migration region
> and
>       communicates it to vendor driver.
>     * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 130
> +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 130 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..a6e45cb2cae2 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> 
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be
> within
> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
> 
>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE
> + 16)
> 
> +/**
> + * VFIO device states :
> + * VFIO User space application should set the device state to indicate
> vendor
> + * driver in which state the VFIO device should transitioned.
> + * - VFIO_DEVICE_STATE_NONE:
> + *   State when VFIO device is initialized but not yet running.
> + * - VFIO_DEVICE_STATE_RUNNING:
> + *   Transition VFIO device in running state, that is, user space application
> or
> + *   VM is active.
> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.
> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are
> running,
> + *   transition VFIO device in pre-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> + *   When VFIO user space application or VM is stopped and vCPUs are
> halted,
> + *   transition VFIO device in stop-and-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> + *   When VFIO user space application has copied data provided by vendor
> driver.
> + *   This state is used by vendor driver to clean up all software state that
> was
> + *   setup during MIGRATION_SETUP state.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> + *   Transition VFIO device to resume state, that is, start resuming VFIO
> device
> + *   when user space application or VM is not running and vCPUs are
> halted.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> + *   When user space application completes iterations of providing device
> state
> + *   data, transition device in resume completed state.
> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> + *   Migration process failed due to some reason, transition device to
> failed
> + *   state. If migration process fails while saving at source, resume device
> at
> + *   source. If migration process fails while resuming application or VM at
> + *   destination, stop restoration at destination and resume at source.
> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> + *   User space application has cancelled migration process either for some
> + *   known reason or due to user's intervention. Transition device to
> Cancelled
> + *   state, that is, resume device state as it was during running state at
> + *   source.
> + */
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};

We discussed in KVM forum to define the interfaces around the state
itself, instead of around live migration flow. Looks this version doesn't 
move that way?

quote the summary from Alex, which though high level but simple
enough to demonstrate the idea:

--
Here we would define "registers" for putting the device in various 
states through the migration process, for example enabling dirty logging, 
suspending the device, resuming the device, direction of data flow 
through the device state area, etc.
--

based on that we just need much fewer states, e.g. {RUNNING, 
RUNNING_DIRTYLOG, STOPPED}. data flow direction doesn't need
to be a state. could just a flag in the region. Those are sufficient to 
enable vGPU live migration on Intel platform. nvidia or other vendors
may have more requirements, which could lead to addition of new
states - but again, they should be defined in a way not tied to migration
flow.

Thanks
Kevin

> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
> related migration
> + * information.
> + *
> + * Action Set state:
> + *      To tell vendor driver the state VFIO device should be transitioned to.
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change, the state to which VFIO device should be
> + *           transitioned to.
> + *
> + * Action Get pending bytes:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      pending.threshold_size [Input] : threshold of buffer in User space
> app.
> + *      pending.precopy_only [output] : pending data which must be
> migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          user space application or VM start. In case of migration, this
> + *          indicates pending bytes to be transfered while application or VM
> or
> + *          vCPUs are active and running.
> + *      pending.compatible [output] : pending data which may be migrated
> any
> + *          time , either when application or VM is active and vCPUs are active
> + *          or when application or VM is halted and vCPUs are halted.
> + *      pending.postcopy_only [output] : pending data which must be
> migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           application or VM stopped and vCPUs are halted.
> + *      Sum of pending.precopy_only, pending.compatible and
> + *      pending.postcopy_only is the whole amount of pending data.
> + *
> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region
> and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.
> + *
> + * Action Set buffer:
> + *      In migration resume path, user space app writes to migration region
> and
> + *      communicates it to vendor driver with this action.
> + *      data.offset [Input] : offset in the region from where data is written.
> + *      data.size [Input] : number of bytes written in migration buffer by
> + *          user space app.
> + *
> + * Action Get dirty pages bitmap:
> + *      Get bitmap of dirty pages from vendor driver from given start
> address.
> + *      dirty_pfns.start_addr [Input] : start address
> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
> + *          dirty bitmap is requested
> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is
> copied
> + *          to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to
> be
> + *      marked dirty in migration region.
> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;
> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
> 
>  /**
> --
> 2.7.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-21  0:26   ` Tian, Kevin
@ 2018-11-21  4:24     ` Kirti Wankhede
  2018-11-21  6:13       ` Tian, Kevin
  0 siblings, 1 reply; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-21  4:24 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, cjia
  Cc: Yang, Ziye, Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet,
	cohuck, dgilbert, jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang, Zhi A, qemu-devel



On 11/21/2018 5:56 AM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, November 21, 2018 4:40 AM
>>
>> - Defined MIGRATION region type and sub-type.
>> - Defined VFIO device states during migration process.
>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>   offset of migration region to get/set VFIO device related information.
>>   Defined actions and members of structure usage for each action:
>>     * To convey VFIO device state to be transitioned to.
>>     * To get pending bytes yet to be migrated for VFIO device
>>     * To ask driver to write data to migration region and return number of
>> bytes
>>       written in the region
>>     * In migration resume path, user space app writes to migration region
>> and
>>       communicates it to vendor driver.
>>     * Get bitmap of dirty pages from vendor driver from given start address
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  linux-headers/linux/vfio.h | 130
>> +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 130 insertions(+)
>>
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 3615a269d378..a6e45cb2cae2 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>
>> +/* Migration region type and sub-type */
>> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>> +
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be
>> mmapped
>>   * which allows direct access to non-MSIX registers which happened to be
>> within
>> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
>>
>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE
>> + 16)
>>
>> +/**
>> + * VFIO device states :
>> + * VFIO User space application should set the device state to indicate
>> vendor
>> + * driver in which state the VFIO device should transitioned.
>> + * - VFIO_DEVICE_STATE_NONE:
>> + *   State when VFIO device is initialized but not yet running.
>> + * - VFIO_DEVICE_STATE_RUNNING:
>> + *   Transition VFIO device in running state, that is, user space application
>> or
>> + *   VM is active.
>> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
>> + *   Transition VFIO device in migration setup state. This is used to prepare
>> + *   VFIO device for migration while application or VM and vCPUs are still in
>> + *   running state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
>> + *   When VFIO user space application or VM is active and vCPUs are
>> running,
>> + *   transition VFIO device in pre-copy state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
>> + *   When VFIO user space application or VM is stopped and vCPUs are
>> halted,
>> + *   transition VFIO device in stop-and-copy state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
>> + *   When VFIO user space application has copied data provided by vendor
>> driver.
>> + *   This state is used by vendor driver to clean up all software state that
>> was
>> + *   setup during MIGRATION_SETUP state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
>> + *   Transition VFIO device to resume state, that is, start resuming VFIO
>> device
>> + *   when user space application or VM is not running and vCPUs are
>> halted.
>> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
>> + *   When user space application completes iterations of providing device
>> state
>> + *   data, transition device in resume completed state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
>> + *   Migration process failed due to some reason, transition device to
>> failed
>> + *   state. If migration process fails while saving at source, resume device
>> at
>> + *   source. If migration process fails while resuming application or VM at
>> + *   destination, stop restoration at destination and resume at source.
>> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
>> + *   User space application has cancelled migration process either for some
>> + *   known reason or due to user's intervention. Transition device to
>> Cancelled
>> + *   state, that is, resume device state as it was during running state at
>> + *   source.
>> + */
>> +
>> +enum {
>> +    VFIO_DEVICE_STATE_NONE,
>> +    VFIO_DEVICE_STATE_RUNNING,
>> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
>> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
>> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
>> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
>> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
>> +};
> 
> We discussed in KVM forum to define the interfaces around the state
> itself, instead of around live migration flow. Looks this version doesn't 
> move that way?
> 

This is patch series is along the discussion we had at KVM forum.

> quote the summary from Alex, which though high level but simple
> enough to demonstrate the idea:
> 
> --
> Here we would define "registers" for putting the device in various 
> states through the migration process, for example enabling dirty logging, 
> suspending the device, resuming the device, direction of data flow 
> through the device state area, etc.
> --
> 

Defined a packed structure here to map it at 0th offset of migration
region so that offset can be calculated by offset_of(), you may call
same as register definitions.


> based on that we just need much fewer states, e.g. {RUNNING, 
> RUNNING_DIRTYLOG, STOPPED}. data flow direction doesn't need
> to be a state. could just a flag in the region.

Flag is not preferred here, multiple flags can be set at a time.
Here need finite states with its proper definition what that device
state means to driver and user space application.
For Intel or others who don't need other states can ignore the state in
driver by taking no action on pwrite on .device_state offset. For
example for Intel driver could only take action on state change to
VFIO_DEVICE_STATE_RUNNING and VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY

I think dirty page logging is not a VFIO device's state.
.log_sync of MemoryListener is called during both :
- PRECOPY phase i.e. while vCPUs are still running and
- during STOPNCOPY phase i.e. when vCPUs are stopped.


> Those are sufficient to 
> enable vGPU live migration on Intel platform. nvidia or other vendors
> may have more requirements, which could lead to addition of new
> states - but again, they should be defined in a way not tied to migration
> flow.
> 

I had tried to explain the intend of each state. Please go through the
comments above.
Also please take a look at other patches, mainly
0003-Add-migration-functions-for-VFIO-devices.patch to understand why
these states are required.

Thanks,
Kirti

> Thanks
> Kevin
> 
>> +
>> +/**
>> + * Structure vfio_device_migration_info is placed at 0th offset of
>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
>> related migration
>> + * information.
>> + *
>> + * Action Set state:
>> + *      To tell vendor driver the state VFIO device should be transitioned to.
>> + *      device_state [input] : User space app sends device state to vendor
>> + *           driver on state change, the state to which VFIO device should be
>> + *           transitioned to.
>> + *
>> + * Action Get pending bytes:
>> + *      To get pending bytes yet to be migrated from vendor driver
>> + *      pending.threshold_size [Input] : threshold of buffer in User space
>> app.
>> + *      pending.precopy_only [output] : pending data which must be
>> migrated in
>> + *          precopy phase or in stopped state, in other words - before target
>> + *          user space application or VM start. In case of migration, this
>> + *          indicates pending bytes to be transfered while application or VM
>> or
>> + *          vCPUs are active and running.
>> + *      pending.compatible [output] : pending data which may be migrated
>> any
>> + *          time , either when application or VM is active and vCPUs are active
>> + *          or when application or VM is halted and vCPUs are halted.
>> + *      pending.postcopy_only [output] : pending data which must be
>> migrated in
>> + *           postcopy phase or in stopped state, in other words - after source
>> + *           application or VM stopped and vCPUs are halted.
>> + *      Sum of pending.precopy_only, pending.compatible and
>> + *      pending.postcopy_only is the whole amount of pending data.
>> + *
>> + * Action Get buffer:
>> + *      On this action, vendor driver should write data to migration region
>> and
>> + *      return number of bytes written in the region.
>> + *      data.offset [output] : offset in the region from where data is written.
>> + *      data.size [output] : number of bytes written in migration buffer by
>> + *          vendor driver.
>> + *
>> + * Action Set buffer:
>> + *      In migration resume path, user space app writes to migration region
>> and
>> + *      communicates it to vendor driver with this action.
>> + *      data.offset [Input] : offset in the region from where data is written.
>> + *      data.size [Input] : number of bytes written in migration buffer by
>> + *          user space app.
>> + *
>> + * Action Get dirty pages bitmap:
>> + *      Get bitmap of dirty pages from vendor driver from given start
>> address.
>> + *      dirty_pfns.start_addr [Input] : start address
>> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
>> + *          dirty bitmap is requested
>> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is
>> copied
>> + *          to migration region.
>> + *      Vendor driver should copy the bitmap with bits set only for pages to
>> be
>> + *      marked dirty in migration region.
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +        __u32 device_state;         /* VFIO device state */
>> +        struct {
>> +            __u64 precopy_only;
>> +            __u64 compatible;
>> +            __u64 postcopy_only;
>> +            __u64 threshold_size;
>> +        } pending;
>> +        struct {
>> +            __u64 offset;           /* offset */
>> +            __u64 size;             /* size */
>> +        } data;
>> +        struct {
>> +            __u64 start_addr;
>> +            __u64 total;
>> +            __u64 copied;
>> +        } dirty_pfns;
>> +} __attribute__((packed));
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>
>>  /**
>> --
>> 2.7.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] Add save and load functions for VFIO PCI devices
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 2/5] Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2018-11-21  5:32   ` Peter Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Xu @ 2018-11-21  5:32 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, Zhengxiao.zx, kevin.tian, yi.l.liu,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

On Wed, Nov 21, 2018 at 02:09:40AM +0530, Kirti Wankhede wrote:
> Save and restore with MSIX type is not tested.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.h | 29 ++++++++++++++++++
>  2 files changed, 124 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 6cbb8fa0549d..72daf1a358a0 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1234,6 +1234,101 @@ void vfio_pci_write_config(PCIDevice *pdev,
>      }
>  }
>  
> +void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar;
> +
> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar);

Is it possible to avoid calling qemu_put_*() directly from vfio code?
E.g., using VMStateDescription and hooks like pre_save, post_load and
etc.  Then we update all the data into the data structure and leave
all the rest of IO operations to general migration framework.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device
  2018-11-20 20:39 [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Kirti Wankhede
                   ` (4 preceding siblings ...)
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 5/5] Make vfio-pci device migration capable Kirti Wankhede
@ 2018-11-21  5:47 ` Peter Xu
  2018-11-22 21:01   ` Kirti Wankhede
  5 siblings, 1 reply; 32+ messages in thread
From: Peter Xu @ 2018-11-21  5:47 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, Zhengxiao.zx, kevin.tian, yi.l.liu,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

On Wed, Nov 21, 2018 at 02:09:38AM +0530, Kirti Wankhede wrote:
> Add migration support for VFIO device

Hi, Kirti,

I failed to apply the series cleanly onto master.  Could you push the
tree somewhere so that people might read the work easier?  Or would
you tell me the base commit, then I can apply it myself.

Thanks in advance,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-21  4:24     ` Kirti Wankhede
@ 2018-11-21  6:13       ` Tian, Kevin
  2018-11-22 20:01         ` Kirti Wankhede
  0 siblings, 1 reply; 32+ messages in thread
From: Tian, Kevin @ 2018-11-21  6:13 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: Yang, Ziye, Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet,
	cohuck, dgilbert, jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang, Zhi A, qemu-devel

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, November 21, 2018 12:24 PM
> 
> 
> On 11/21/2018 5:56 AM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Wednesday, November 21, 2018 4:40 AM
> >>
> >> - Defined MIGRATION region type and sub-type.
> >> - Defined VFIO device states during migration process.
> >> - Defined vfio_device_migration_info structure which will be placed at
> 0th
> >>   offset of migration region to get/set VFIO device related information.
> >>   Defined actions and members of structure usage for each action:
> >>     * To convey VFIO device state to be transitioned to.
> >>     * To get pending bytes yet to be migrated for VFIO device
> >>     * To ask driver to write data to migration region and return number of
> >> bytes
> >>       written in the region
> >>     * In migration resume path, user space app writes to migration region
> >> and
> >>       communicates it to vendor driver.
> >>     * Get bitmap of dirty pages from vendor driver from given start
> address
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 130
> >> +++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 130 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 3615a269d378..a6e45cb2cae2 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
> >>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >>
> >> +/* Migration region type and sub-type */
> >> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
> >> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> >> +
> >>  /*
> >>   * The MSIX mappable capability informs that MSIX data of a BAR can be
> >> mmapped
> >>   * which allows direct access to non-MSIX registers which happened to
> be
> >> within
> >> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
> >>
> >>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE
> >> + 16)
> >>
> >> +/**
> >> + * VFIO device states :
> >> + * VFIO User space application should set the device state to indicate
> >> vendor
> >> + * driver in which state the VFIO device should transitioned.
> >> + * - VFIO_DEVICE_STATE_NONE:
> >> + *   State when VFIO device is initialized but not yet running.
> >> + * - VFIO_DEVICE_STATE_RUNNING:
> >> + *   Transition VFIO device in running state, that is, user space
> application
> >> or
> >> + *   VM is active.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> >> + *   Transition VFIO device in migration setup state. This is used to
> prepare
> >> + *   VFIO device for migration while application or VM and vCPUs are
> still in
> >> + *   running state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> >> + *   When VFIO user space application or VM is active and vCPUs are
> >> running,
> >> + *   transition VFIO device in pre-copy state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> >> + *   When VFIO user space application or VM is stopped and vCPUs are
> >> halted,
> >> + *   transition VFIO device in stop-and-copy state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> >> + *   When VFIO user space application has copied data provided by
> vendor
> >> driver.
> >> + *   This state is used by vendor driver to clean up all software state that
> >> was
> >> + *   setup during MIGRATION_SETUP state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> >> + *   Transition VFIO device to resume state, that is, start resuming VFIO
> >> device
> >> + *   when user space application or VM is not running and vCPUs are
> >> halted.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> >> + *   When user space application completes iterations of providing
> device
> >> state
> >> + *   data, transition device in resume completed state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> >> + *   Migration process failed due to some reason, transition device to
> >> failed
> >> + *   state. If migration process fails while saving at source, resume
> device
> >> at
> >> + *   source. If migration process fails while resuming application or VM
> at
> >> + *   destination, stop restoration at destination and resume at source.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> >> + *   User space application has cancelled migration process either for
> some
> >> + *   known reason or due to user's intervention. Transition device to
> >> Cancelled
> >> + *   state, that is, resume device state as it was during running state at
> >> + *   source.
> >> + */
> >> +
> >> +enum {
> >> +    VFIO_DEVICE_STATE_NONE,
> >> +    VFIO_DEVICE_STATE_RUNNING,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> >> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> >> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> >> +};
> >
> > We discussed in KVM forum to define the interfaces around the state
> > itself, instead of around live migration flow. Looks this version doesn't
> > move that way?
> >
> 
> This is patch series is along the discussion we had at KVM forum.
> 
> > quote the summary from Alex, which though high level but simple
> > enough to demonstrate the idea:
> >
> > --
> > Here we would define "registers" for putting the device in various
> > states through the migration process, for example enabling dirty logging,
> > suspending the device, resuming the device, direction of data flow
> > through the device state area, etc.
> > --
> >
> 
> Defined a packed structure here to map it at 0th offset of migration
> region so that offset can be calculated by offset_of(), you may call
> same as register definitions.

yes, this part is a good change. My comment was around state definition
itself.

> 
> 
> > based on that we just need much fewer states, e.g. {RUNNING,
> > RUNNING_DIRTYLOG, STOPPED}. data flow direction doesn't need
> > to be a state. could just a flag in the region.
> 
> Flag is not preferred here, multiple flags can be set at a time.
> Here need finite states with its proper definition what that device
> state means to driver and user space application.
> For Intel or others who don't need other states can ignore the state in
> driver by taking no action on pwrite on .device_state offset. For
> example for Intel driver could only take action on state change to
> VFIO_DEVICE_STATE_RUNNING and
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY
> 
> I think dirty page logging is not a VFIO device's state.
> .log_sync of MemoryListener is called during both :
> - PRECOPY phase i.e. while vCPUs are still running and
> - during STOPNCOPY phase i.e. when vCPUs are stopped.
> 
> 
> > Those are sufficient to
> > enable vGPU live migration on Intel platform. nvidia or other vendors
> > may have more requirements, which could lead to addition of new
> > states - but again, they should be defined in a way not tied to migration
> > flow.
> >
> 
> I had tried to explain the intend of each state. Please go through the
> comments above.
> Also please take a look at other patches, mainly
> 0003-Add-migration-functions-for-VFIO-devices.patch to understand why
> these states are required.
> 

I looked at the explanations in this patch, but still didn't get the intention, e.g.:

+ * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
+ *   Transition VFIO device in migration setup state. This is used to prepare
+ *   VFIO device for migration while application or VM and vCPUs are still in
+ *   running state.

what preparation is actually required? any example?

+ * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
+ *   When VFIO user space application or VM is active and vCPUs are running,
+ *   transition VFIO device in pre-copy state.

why does device driver need know this stage? in precopy phase, the VM
is still running. Just dirty page tracking is in progress. the dirty bitmap could
be retrieved through its own action interface.

you have code to demonstrate how those states are transitioned in Qemu,
but you didn't show evidence why those states are necessary in device side,
which leads to the puzzle whether the definition is over-killed and limiting.

the flow in my mind is like below:

1. an interface to turn on/off dirty page tracking on VFIO device:
	* vendor driver can do whatever required to enable device specific
dirty page tracking mechanism here
	* device state is not changed here. still in running state

2. an interface to get dirty page bitmap

3. an interface to start/stop device activity
	* the effect of stop is to stop and drain in-the-fly device activities and
make device state ready for dump-out. vendor driver can do specific preparation 
here
	* the effect of start is to check validity of device state and then resume
device activities. again, vendor driver can do specific cleanup/preparation here

4. an interface to save/restore device state
	* should happen when device is stopped
	* of course there is still an open how to check state compatibility as
Alex pointed earlier

this way above interfaces are not tied to migration. other usages which are
interested in device state could also work (e.g. snapshot). If it doesn't work
with your device, it's better that you can elaborate your requirement with more
concrete examples. Then people will judge the necessity of a more complex
interface as proposed in this series...

Thanks
Kevin



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices Kirti Wankhede
@ 2018-11-21  7:39   ` Zhao, Yan Y
  2018-11-22 21:21     ` Kirti Wankhede
  2018-11-22  8:22   ` Zhao, Yan Y
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Zhao, Yan Y @ 2018-11-21  7:39 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic



> -----Original Message-----
> From: Qemu-devel [mailto:qemu-devel-
> bounces+yan.y.zhao=intel.com@nongnu.org] On Behalf Of Kirti Wankhede
> Sent: Wednesday, November 21, 2018 4:40 AM
> To: alex.williamson@redhat.com; cjia@nvidia.com
> Cc: Zhengxiao.zx@Alibaba-inc.com; Tian, Kevin <kevin.tian@intel.com>; Liu, Yi L
> <yi.l.liu@intel.com>; eskultet@redhat.com; Yang, Ziye <ziye.yang@intel.com>;
> qemu-devel@nongnu.org; cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com; Wang, Zhi A <zhi.a.wang@intel.com>;
> mlevitsk@redhat.com; pasic@linux.ibm.com; aik@ozlabs.ru; Kirti Wankhede
> <kwankhede@nvidia.com>; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; Liu, Changpeng <changpeng.liu@intel.com>;
> Ken.Xue@amd.com
> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> 
> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supportd migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read data offset and size using structure
>   vfio_device_migration_info.data, accordingly copy data from the region.
> - To restore device state, write data offset and size in the structure and write
>   data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count of
>   pfns copied and accordingly read those from the rest of the region or mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 729
> ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 753 insertions(+), 1 deletion(-)  create mode 100644
> hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index
> a2e7a0a7cf02..2cf2ba1440f2 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,5 +1,5 @@
>  ifeq ($(CONFIG_LINUX), y)
> -obj-$(CONFIG_SOFTMMU) += common.o
> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644
> index 000000000000..717fb63e4f43
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,729 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2018
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev) {
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.buffer.size) {
> +        vfio_region_exit(&migration->region.buffer);
> +        vfio_region_finalize(&migration->region.buffer);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!migration) {
> +        return ret;
> +    }
> +
> +    /* Migration support added for PCI device only */
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        obj = vfio_pci_get_object(vbasedev);
> +    }
> +
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
> +                            migration->region.index, "migration");
> +    if (ret) {
> +        error_report("Failed to setup VFIO migration region %d: %s",
> +                      migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.buffer.size) {
> +        ret = -EINVAL;
> +        error_report("Invalid region size of VFIO migration region %d: %s",
> +                     migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (migration->region.buffer.mmaps) {
> +        ret = vfio_region_mmap(&migration->region.buffer);
> +        if (ret) {
> +            error_report("Failed to mmap VFIO migration region %d: %s",
> +                         migration->region.index, strerror(-ret));
> +            goto err;
> +        }
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t
> +state) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    int ret = 0;
> +
> +    if (vbasedev->device_state == state) {
> +        return ret;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &state, sizeof(state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("Failed to set migration state %d %s",
> +                     ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    vbasedev->device_state = state;
> +    return ret;
> +}
> +
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> +                              uint64_t start_addr,
> +                              uint64_t pfn_count) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    uint64_t count = 0;
> +    int ret;
> +
> +    migration_info.dirty_pfns.start_addr = start_addr;
> +    migration_info.dirty_pfns.total = pfn_count;
> +
> +    ret = pwrite(vbasedev->fd, &migration_info.dirty_pfns,
> +                 sizeof(migration_info.dirty_pfns),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              dirty_pfns));
> +    if (ret < 0) {
> +        error_report("Failed to set dirty pages start address %d %s",
> +                ret, strerror(errno));
> +        return;
> +    }
> +
> +    do {
> +        /* Read dirty_pfns.copied */
> +        ret = pread(vbasedev->fd, &migration_info.dirty_pfns,
> +                sizeof(migration_info.dirty_pfns),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             dirty_pfns));
> +        if (ret < 0) {
> +            error_report("Failed to get dirty pages bitmap count %d %s",
> +                    ret, strerror(errno));
> +            return;
> +        }
> +
> +        if (migration_info.dirty_pfns.copied) {
> +            uint64_t bitmap_size;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +
> +            bitmap_size = (BITS_TO_LONGS(migration_info.dirty_pfns.copied) + 1)
> +                           * sizeof(unsigned long);
> +
> +            if (region->mmaps) {
> +                int i;
> +
> +                for (i = 0; i < region->nr_mmaps; i++) {
> +                    if (region->mmaps[i].size >= bitmap_size) {
> +                        buf = region->mmaps[i].mmap;
> +                        buffer_mmaped = true;
> +                        break;
> +                    }
> +                }
> +            }
What if mmapped data area is in front of mmaped dirty bit area?
Maybe you need to record dirty bit region's index as what does for data region.

> +
> +            if (!buffer_mmaped) {
> +                buf = g_malloc0(bitmap_size);
> +
> +                ret = pread(vbasedev->fd, buf, bitmap_size,
> +                            region->fd_offset + sizeof(migration_info) + 1);
> +                if (ret != bitmap_size) {
> +                    error_report("Failed to get migration data %d", ret);
> +                    g_free(buf);
> +                    return;
> +                }
> +            }
> +
> +            cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> +                                        start_addr + (count * TARGET_PAGE_SIZE),
> +                                        migration_info.dirty_pfns.copied);
> +            count +=  migration_info.dirty_pfns.copied;
> +
> +            if (!buffer_mmaped) {
> +                g_free(buf);
> +            }
> +        }
> +    } while (count < migration_info.dirty_pfns.total); }
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_save_config(vbasedev, f);
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}
> +
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_load_config(vbasedev, f);
> +    }
> +
> +    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("Wrong end of block while loading device config space");
> +        return -EINVAL;
> +    }
> +
> +    return qemu_file_get_error(f);
> +}
> +
What's the purpose to add a tailing VFIO_MIG_FLAG_END_OF_STATE for each section? 
For compatibility check?
Maybe a version id or magic in struct vfio_device_migration_info is more appropriate?


> +/*
> +----------------------------------------------------------------------
> +*/
> +
> +static bool vfio_is_active_iterate(void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->vm_running && vbasedev->migration &&
> +        (vbasedev->migration->pending_precopy_only != 0))
> +        return true;
> +
> +    if (!vbasedev->vm_running && vbasedev->migration &&
> +        (vbasedev->migration->pending_postcopy != 0))
> +        return true;
> +
> +    return false;
> +}
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    qemu_mutex_lock_iothread();
> +    ret = vfio_migration_region_init(vbasedev);
> +    qemu_mutex_unlock_iothread();
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &migration_info.data,
> +                sizeof(migration_info.data),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data));
> +    if (ret != sizeof(migration_info.data)) {
> +        error_report("Failed to get migration buffer information %d",
> +                     ret);
> +        return -EINVAL;
> +    }
> +
> +    if (migration_info.data.size) {
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        if (region->mmaps) {
> +            int i;
> +
> +            for (i = 0; i < region->nr_mmaps; i++) {
> +                if (region->mmaps[i].offset == migration_info.data.offset) {
> +                    buf = region->mmaps[i].mmap;
> +                    buffer_mmaped = true;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        if (!buffer_mmaped) {
> +            buf = g_malloc0(migration_info.data.size);
> +            ret = pread(vbasedev->fd, buf, migration_info.data.size,
> +                        region->fd_offset + migration_info.data.offset);
> +            if (ret != migration_info.data.size) {
> +                error_report("Failed to get migration data %d", ret);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, migration_info.data.size);
> +        qemu_put_buffer(f, buf, migration_info.data.size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +
> +    } else {
> +        qemu_put_be64(f, migration_info.data.size);
> +    }
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return migration_info.data.size;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    ret = vfio_save_buffer(f, vbasedev);
> +    if (ret < 0) {
> +        error_report("vfio_save_buffer failed %s",
> +                     strerror(errno));
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return ret;
> +}
> +
> +static void vfio_update_pending(VFIODevice *vbasedev, uint64_t
> +threshold_size) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    int ret;
> +
> +    ret = pwrite(vbasedev->fd, &threshold_size, sizeof(threshold_size),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              pending.threshold_size));
> +    if (ret < 0) {
> +        error_report("Failed to set threshold size %d %s",
> +                     ret, strerror(errno));
> +        return;
> +    }
> +
> +    ret = pread(vbasedev->fd, &migration_info.pending,
> +                sizeof(migration_info.pending),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending));
> +    if (ret != sizeof(migration_info.pending)) {
> +        error_report("Failed to get pending bytes %d", ret);
> +        return;
> +    }
> +
> +    migration->pending_precopy_only = migration_info.pending.precopy_only;
> +    migration->pending_compatible = migration_info.pending.compatible;
> +    migration->pending_postcopy = migration_info.pending.postcopy_only;
> +
> +    return;
> +}
> +
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only) {
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    vfio_update_pending(vbasedev, threshold_size);
> +
> +    *res_precopy_only += migration->pending_precopy_only;
> +    *res_compatible += migration->pending_compatible;
> +    *res_postcopy_only += migration->pending_postcopy; }
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    MigrationState *ms = migrate_get_current();
> +    int ret;
> +
> +    if (vbasedev->vm_running) {
> +        vbasedev->vm_running = 0;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                   VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> +    if (ret) {
> +        error_report("Failed to set state STOPNCOPY_ACTIVE");
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    do {
> +        vfio_update_pending(vbasedev, ms->threshold_size);
> +
> +        if (vfio_is_active_iterate(opaque)) {
> +            ret = vfio_save_buffer(f, vbasedev);
> +            if (ret < 0) {
> +                error_report("Failed to save buffer");
> +                break;
> +            } else if (ret == 0) {
> +                break;
> +            }
> +        }
> +    } while ((migration->pending_compatible +
> + migration->pending_postcopy) > 0);
> +

if migration->pending_postcopy is not 0, vfio_save_complete_precopy() cannot finish? Is that true?
But vfio_save_complete_precopy() does not need to copy post copy data.



> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                   VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED);
> +    if (ret) {
> +        error_report("Failed to set state SAVE_COMPLETED");
> +        return ret;
> +    }
> +    return ret;
> +}
> +
> +static void vfio_save_cleanup(void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_migration_region_exit(vbasedev);
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +    uint64_t data;
> +
Need to use version_id to check source and target version.

> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return 0;
> +            } else {
> +                error_report("SETUP STATE: EOS not found 0x%lx", data);
> +                return -EINVAL;
> +            }
> +        } else if (data != 0) {
> +            VFIOMigration *migration = vbasedev->migration;
> +            VFIORegion *region = &migration->region.buffer;
> +            struct vfio_device_migration_info migration_info;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +
> +            migration_info.data.size = data;
> +
> +            if (region->mmaps) {
> +                int i;
> +
> +                for (i = 0; i < region->nr_mmaps; i++) {
> +                    if (region->mmaps[i].mmap &&
> +                        (region->mmaps[i].size >= data)) {
> +                        buf = region->mmaps[i].mmap;
> +                        migration_info.data.offset = region->mmaps[i].offset;
> +                        buffer_mmaped = true;
> +                        break;
> +                    }
> +                }
> +            }
> +
> +            if (!buffer_mmaped) {
> +                buf = g_malloc0(migration_info.data.size);
> +                migration_info.data.offset = sizeof(migration_info) + 1;
> +            }
> +
> +            qemu_get_buffer(f, buf, data);
> +
> +            ret = pwrite(vbasedev->fd, &migration_info.data,
> +                         sizeof(migration_info.data),
> +                         region->fd_offset +
> +                         offsetof(struct vfio_device_migration_info, data));
> +            if (ret != sizeof(migration_info.data)) {
> +                error_report("Failed to set migration buffer information %d",
> +                        ret);
> +                return -EINVAL;
> +            }
> +
> +            if (!buffer_mmaped) {
> +                ret = pwrite(vbasedev->fd, buf, migration_info.data.size,
> +                             region->fd_offset + migration_info.data.offset);
> +                g_free(buf);
> +
> +                if (ret != migration_info.data.size) {
> +                    error_report("Failed to set migration buffer %d", ret);
> +                    return -EINVAL;
> +                }
> +            }
> +        }
> +
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +        data = qemu_get_be64(f);
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_load_setup(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                    VFIO_DEVICE_STATE_MIGRATION_RESUME);
> +    if (ret) {
> +        error_report("Failed to set state RESUME");
> +    }
> +
> +    ret = vfio_migration_region_init(vbasedev);
> +    if (ret) {
> +        error_report("Failed to initialise migration region");
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
Why vfio_migration_set_state() is in front of  vfio_migration_region_init()?
So VFIO_DEVICE_STATE_MIGRATION_RESUME is really useful? :)


> +static int vfio_load_cleanup(void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret = 0;
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                 VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED);
> +    if (ret) {
> +        error_report("Failed to set state RESUME_COMPLETED");
> +    }
> +
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_pending = vfio_save_pending,
> +    .save_cleanup = vfio_save_cleanup,
> +    .load_state = vfio_load_state,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .is_active_iterate = vfio_is_active_iterate, };
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState
> +state) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running) && running) {
> +        int ret;
> +
> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("Failed to set state RUNNING");
> +        }
> +    }
> +
> +    vbasedev->vm_running = running;
> +}
> +
vfio_vmstate_change() is registered at initialization, so for source vm 
vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING) will be called on vm start.
but vfio_migration_region_init() is called in save_setup() when migration starts,
as a result,  "vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING)" should not function here for source vm.
So, again, is this state really used? :)


> +static void vfio_migration_state_notifier(Notifier *notifier, void
> +*data) {
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_SETUP:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_SETUP);
> +        if (ret) {
> +            error_report("Failed to set state SETUP");
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_ACTIVE:
> +        if (vbasedev->device_state == VFIO_DEVICE_STATE_MIGRATION_SETUP) {
> +            if (vbasedev->vm_running) {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                          VFIO_DEVICE_STATE_MIGRATION_PRECOPY);
> +                if (ret) {
> +                    error_report("Failed to set state PRECOPY_ACTIVE");
> +                }
> +            } else {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                        VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> +                if (ret) {
> +                    error_report("Failed to set state STOPNCOPY_ACTIVE");
> +                }
> +            }
> +        } else {
> +            ret = vfio_migration_set_state(vbasedev,
> +                                           VFIO_DEVICE_STATE_MIGRATION_RESUME);
> +            if (ret) {
> +                error_report("Failed to set state RESUME");
> +            }
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_CANCELLED);
> +        if (ret) {
> +            error_report("Failed to set state CANCELLED");
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_FAILED);
> +        if (ret) {
> +            error_report("Failed to set state FAILED");
> +        }
> +        return;
> +    }
> +}
> +
> +static int vfio_migration_init(VFIODevice *vbasedev,
> +                               struct vfio_region_info *info) {
> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> +    vbasedev->migration->region.index = info->index;
> +
> +    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
> +    vbasedev->vm_state =
> qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
> +
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> +
> +    return 0;
> +}
> +
> +
> +/*
> +----------------------------------------------------------------------
> +*/
> +
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) {
> +    struct vfio_region_info *info;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    if (ret) {
> +        Error *local_err = NULL;
> +
> +        error_setg(&vbasedev->migration_blocker,
> +                   "VFIO device doesn't support migration");
> +        ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            error_free(vbasedev->migration_blocker);
> +            return ret;
> +        }
> +    } else {
> +        return vfio_migration_init(vbasedev, info);
> +    }
> +
> +    return 0;
> +}
> +
> +void vfio_migration_finalize(VFIODevice *vbasedev) {
> +    if (!vbasedev->migration) {
> +        return;
> +    }
> +
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
> +    if (vbasedev->migration_blocker) {
> +        migrate_del_blocker(vbasedev->migration_blocker);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +
> +    g_free(vbasedev->migration);
> +}
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a9036929b220..ab8217c9e249 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -30,6 +30,8 @@
>  #include <linux/vfio.h>
>  #endif
> 
> +#include "sysemu/sysemu.h"
> +
>  #define ERR_PREFIX "vfio error: %s: "
>  #define WARN_PREFIX "vfio warning: %s: "
> 
> @@ -57,6 +59,16 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */  } VFIORegion;
> 
> +typedef struct VFIOMigration {
> +    struct {
> +        VFIORegion buffer;
> +        uint32_t index;
> +    } region;
> +    uint64_t pending_precopy_only;
> +    uint64_t pending_compatible;
> +    uint64_t pending_postcopy;
> +} VFIOMigration;
> +
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers; @@ -116,6 +128,12 @@ typedef
> struct VFIODevice {
>      unsigned int num_irqs;
>      unsigned int num_regions;
>      unsigned int flags;
> +    uint32_t device_state;
> +    VMChangeStateEntry *vm_state;
> +    int vm_running;
> +    Notifier migration_state;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>  } VFIODevice;
> 
>  struct VFIODeviceOps {
> @@ -193,4 +211,9 @@ int vfio_spapr_create_window(VFIOContainer
> *container,  int vfio_spapr_remove_window(VFIOContainer *container,
>                               hwaddr offset_within_address_space);
> 
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp); void
> +vfio_migration_finalize(VFIODevice *vbasedev); void
> +vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_addr,
> +                               uint64_t pfn_count);
> +
>  #endif /* HW_VFIO_VFIO_COMMON_H */
> --
> 2.7.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Kirti Wankhede
  2018-11-21  0:26   ` Tian, Kevin
@ 2018-11-21 17:26   ` Pierre Morel
  2018-11-22 18:54   ` Dr. David Alan Gilbert
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Pierre Morel @ 2018-11-21 17:26 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	changpeng.liu, Ken.Xue

On 20/11/2018 21:39, Kirti Wankhede wrote:
> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>    offset of migration region to get/set VFIO device related information.
>    Defined actions and members of structure usage for each action:
>      * To convey VFIO device state to be transitioned to.
>      * To get pending bytes yet to be migrated for VFIO device
>      * To ask driver to write data to migration region and return number of bytes
>        written in the region
>      * In migration resume path, user space app writes to migration region and
>        communicates it to vendor driver.
>      * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>   linux-headers/linux/vfio.h | 130 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 130 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..a6e45cb2cae2 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>   #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>   #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> 
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +
>   /*
>    * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>    * which allows direct access to non-MSIX registers which happened to be within
> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
> 
>   #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
> 
> +/**
> + * VFIO device states :
> + * VFIO User space application should set the device state to indicate vendor
> + * driver in which state the VFIO device should transitioned.
> + * - VFIO_DEVICE_STATE_NONE:
> + *   State when VFIO device is initialized but not yet running.
> + * - VFIO_DEVICE_STATE_RUNNING:
> + *   Transition VFIO device in running state, that is, user space application or
> + *   VM is active.
> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.
> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are running,
> + *   transition VFIO device in pre-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> + *   When VFIO user space application or VM is stopped and vCPUs are halted,
> + *   transition VFIO device in stop-and-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> + *   When VFIO user space application has copied data provided by vendor driver.
> + *   This state is used by vendor driver to clean up all software state that was
> + *   setup during MIGRATION_SETUP state.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> + *   Transition VFIO device to resume state, that is, start resuming VFIO device
> + *   when user space application or VM is not running and vCPUs are halted.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> + *   When user space application completes iterations of providing device state
> + *   data, transition device in resume completed state.
> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> + *   Migration process failed due to some reason, transition device to failed
> + *   state. If migration process fails while saving at source, resume device at
> + *   source. If migration process fails while resuming application or VM at
> + *   destination, stop restoration at destination and resume at source.
> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> + *   User space application has cancelled migration process either for some
> + *   known reason or due to user's intervention. Transition device to Cancelled
> + *   state, that is, resume device state as it was during running state at
> + *   source.
> + */
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information.
> + *
> + * Action Set state:
> + *      To tell vendor driver the state VFIO device should be transitioned to.
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change, the state to which VFIO device should be
> + *           transitioned to.
> + *
> + * Action Get pending bytes:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      pending.threshold_size [Input] : threshold of buffer in User space app.
> + *      pending.precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          user space application or VM start. In case of migration, this
> + *          indicates pending bytes to be transfered while application or VM or
> + *          vCPUs are active and running.
> + *      pending.compatible [output] : pending data which may be migrated any
> + *          time , either when application or VM is active and vCPUs are active
> + *          or when application or VM is halted and vCPUs are halted.
> + *      pending.postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           application or VM stopped and vCPUs are halted.
> + *      Sum of pending.precopy_only, pending.compatible and
> + *      pending.postcopy_only is the whole amount of pending data.
> + *
> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.
> + *
> + * Action Set buffer:
> + *      In migration resume path, user space app writes to migration region and
> + *      communicates it to vendor driver with this action.
> + *      data.offset [Input] : offset in the region from where data is written.
> + *      data.size [Input] : number of bytes written in migration buffer by
> + *          user space app.
> + *
> + * Action Get dirty pages bitmap:
> + *      Get bitmap of dirty pages from vendor driver from given start address.
> + *      dirty_pfns.start_addr [Input] : start address
> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
> + *          dirty bitmap is requested
> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is copied
> + *          to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + */
> +

Hi Kirti,

I am very interested in your work, thanks for it.
I just begin to look at it.

> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */

May be it is a little soon to care about this but wouldn't the __u32 
here cause a problem, even with packed (or due to packed), for different 
architectures?
Wouldn't it be better to use a __u64 for the state and keep all 
naturally aligned?

Regards,
Pierre


> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;
> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));



> +
>   /* -------- API for Type1 VFIO IOMMU -------- */
> 
>   /**
> 


-- 
Pierre Morel
Linux/KVM/QEMU in Böblingen - Germany

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices Kirti Wankhede
  2018-11-21  7:39   ` Zhao, Yan Y
@ 2018-11-22  8:22   ` Zhao, Yan Y
  2018-11-22 19:57   ` Dr. David Alan Gilbert
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Zhao, Yan Y @ 2018-11-22  8:22 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies, Liu,
	Changpeng, Ken.Xue



> -----Original Message-----
> From: Qemu-devel [mailto:qemu-devel-
> bounces+yan.y.zhao=intel.com@nongnu.org] On Behalf Of Kirti Wankhede
> Sent: Wednesday, November 21, 2018 4:40 AM
> To: alex.williamson@redhat.com; cjia@nvidia.com
> Cc: Zhengxiao.zx@Alibaba-inc.com; Tian, Kevin <kevin.tian@intel.com>; Liu, Yi L
> <yi.l.liu@intel.com>; eskultet@redhat.com; Yang, Ziye <ziye.yang@intel.com>;
> qemu-devel@nongnu.org; cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com; Wang, Zhi A <zhi.a.wang@intel.com>;
> mlevitsk@redhat.com; pasic@linux.ibm.com; aik@ozlabs.ru; Kirti Wankhede
> <kwankhede@nvidia.com>; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; Liu, Changpeng <changpeng.liu@intel.com>;
> Ken.Xue@amd.com
> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> 
> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supportd migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read data offset and size using structure
>   vfio_device_migration_info.data, accordingly copy data from the region.
> - To restore device state, write data offset and size in the structure and write
>   data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count of
>   pfns copied and accordingly read those from the rest of the region or mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 729
> ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 753 insertions(+), 1 deletion(-)  create mode 100644
> hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index
> a2e7a0a7cf02..2cf2ba1440f2 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,5 +1,5 @@
>  ifeq ($(CONFIG_LINUX), y)
> -obj-$(CONFIG_SOFTMMU) += common.o
> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644
> index 000000000000..717fb63e4f43
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,729 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2018
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev) {
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.buffer.size) {
> +        vfio_region_exit(&migration->region.buffer);
> +        vfio_region_finalize(&migration->region.buffer);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!migration) {
> +        return ret;
> +    }
> +
> +    /* Migration support added for PCI device only */
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        obj = vfio_pci_get_object(vbasedev);
> +    }
> +
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
> +                            migration->region.index, "migration");
> +    if (ret) {
> +        error_report("Failed to setup VFIO migration region %d: %s",
> +                      migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.buffer.size) {
> +        ret = -EINVAL;
> +        error_report("Invalid region size of VFIO migration region %d: %s",
> +                     migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (migration->region.buffer.mmaps) {
> +        ret = vfio_region_mmap(&migration->region.buffer);
> +        if (ret) {
> +            error_report("Failed to mmap VFIO migration region %d: %s",
> +                         migration->region.index, strerror(-ret));
> +            goto err;
> +        }
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t
> +state) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    int ret = 0;
> +
> +    if (vbasedev->device_state == state) {
> +        return ret;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &state, sizeof(state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("Failed to set migration state %d %s",
> +                     ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    vbasedev->device_state = state;
> +    return ret;
> +}
> +
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> +                              uint64_t start_addr,
> +                              uint64_t pfn_count) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    uint64_t count = 0;
> +    int ret;
> +
> +    migration_info.dirty_pfns.start_addr = start_addr;
> +    migration_info.dirty_pfns.total = pfn_count;
> +
> +    ret = pwrite(vbasedev->fd, &migration_info.dirty_pfns,
> +                 sizeof(migration_info.dirty_pfns),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              dirty_pfns));
> +    if (ret < 0) {
> +        error_report("Failed to set dirty pages start address %d %s",
> +                ret, strerror(errno));
> +        return;
> +    }
> +
> +    do {
> +        /* Read dirty_pfns.copied */
> +        ret = pread(vbasedev->fd, &migration_info.dirty_pfns,
> +                sizeof(migration_info.dirty_pfns),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             dirty_pfns));
> +        if (ret < 0) {
> +            error_report("Failed to get dirty pages bitmap count %d %s",
> +                    ret, strerror(errno));
> +            return;
> +        }
> +
> +        if (migration_info.dirty_pfns.copied) {
> +            uint64_t bitmap_size;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +
> +            bitmap_size = (BITS_TO_LONGS(migration_info.dirty_pfns.copied) + 1)
> +                           * sizeof(unsigned long);
> +
> +            if (region->mmaps) {
> +                int i;
> +
> +                for (i = 0; i < region->nr_mmaps; i++) {
> +                    if (region->mmaps[i].size >= bitmap_size) {
> +                        buf = region->mmaps[i].mmap;
> +                        buffer_mmaped = true;
> +                        break;
> +                    }
> +                }
> +            }
> +
> +            if (!buffer_mmaped) {
> +                buf = g_malloc0(bitmap_size);
> +
> +                ret = pread(vbasedev->fd, buf, bitmap_size,
> +                            region->fd_offset + sizeof(migration_info) + 1);
> +                if (ret != bitmap_size) {
> +                    error_report("Failed to get migration data %d", ret);
> +                    g_free(buf);
> +                    return;
> +                }
> +            }
> +
> +            cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> +                                        start_addr + (count * TARGET_PAGE_SIZE),
> +                                        migration_info.dirty_pfns.copied);
> +            count +=  migration_info.dirty_pfns.copied;
> +
> +            if (!buffer_mmaped) {
> +                g_free(buf);
> +            }
> +        }
> +    } while (count < migration_info.dirty_pfns.total); }
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_save_config(vbasedev, f);
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}
> +
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_load_config(vbasedev, f);
> +    }
> +
> +    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("Wrong end of block while loading device config space");
> +        return -EINVAL;
> +    }
> +
> +    return qemu_file_get_error(f);
> +}
> +
> +/*
> +----------------------------------------------------------------------
> +*/
> +
> +static bool vfio_is_active_iterate(void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->vm_running && vbasedev->migration &&
> +        (vbasedev->migration->pending_precopy_only != 0))
> +        return true;
> +
> +    if (!vbasedev->vm_running && vbasedev->migration &&
> +        (vbasedev->migration->pending_postcopy != 0))
> +        return true;
> +
> +    return false;
> +}
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    qemu_mutex_lock_iothread();
> +    ret = vfio_migration_region_init(vbasedev);
> +    qemu_mutex_unlock_iothread();
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &migration_info.data,
> +                sizeof(migration_info.data),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data));
> +    if (ret != sizeof(migration_info.data)) {
> +        error_report("Failed to get migration buffer information %d",
> +                     ret);
> +        return -EINVAL;
> +    }
> +
> +    if (migration_info.data.size) {
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        if (region->mmaps) {
> +            int i;
> +
> +            for (i = 0; i < region->nr_mmaps; i++) {
> +                if (region->mmaps[i].offset == migration_info.data.offset) {
> +                    buf = region->mmaps[i].mmap;
> +                    buffer_mmaped = true;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        if (!buffer_mmaped) {
> +            buf = g_malloc0(migration_info.data.size);
> +            ret = pread(vbasedev->fd, buf, migration_info.data.size,
> +                        region->fd_offset + migration_info.data.offset);
> +            if (ret != migration_info.data.size) {
> +                error_report("Failed to get migration data %d", ret);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, migration_info.data.size);
> +        qemu_put_buffer(f, buf, migration_info.data.size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +
> +    } else {
> +        qemu_put_be64(f, migration_info.data.size);
> +    }
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return migration_info.data.size;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    ret = vfio_save_buffer(f, vbasedev);
> +    if (ret < 0) {
> +        error_report("vfio_save_buffer failed %s",
> +                     strerror(errno));
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return ret;
> +}
> +
> +static void vfio_update_pending(VFIODevice *vbasedev, uint64_t
> +threshold_size) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    int ret;
> +
> +    ret = pwrite(vbasedev->fd, &threshold_size, sizeof(threshold_size),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              pending.threshold_size));
> +    if (ret < 0) {
> +        error_report("Failed to set threshold size %d %s",
> +                     ret, strerror(errno));
> +        return;
> +    }
> +
> +    ret = pread(vbasedev->fd, &migration_info.pending,
> +                sizeof(migration_info.pending),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending));
> +    if (ret != sizeof(migration_info.pending)) {
> +        error_report("Failed to get pending bytes %d", ret);
> +        return;
> +    }
> +
> +    migration->pending_precopy_only = migration_info.pending.precopy_only;
> +    migration->pending_compatible = migration_info.pending.compatible;
> +    migration->pending_postcopy = migration_info.pending.postcopy_only;
> +
> +    return;
> +}
> +
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only) {
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    vfio_update_pending(vbasedev, threshold_size);
> +
> +    *res_precopy_only += migration->pending_precopy_only;
> +    *res_compatible += migration->pending_compatible;
> +    *res_postcopy_only += migration->pending_postcopy; }
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    MigrationState *ms = migrate_get_current();
> +    int ret;
> +
> +    if (vbasedev->vm_running) {
> +        vbasedev->vm_running = 0;
> +    }
> +
vbasedev->vm_running should be already set to 0 before calling save_complete_precopy(), 
in vfio_vmstate_change(). Here, no need to reset vm_running state or only assertion is needed.


> +    ret = vfio_migration_set_state(vbasedev,
> +                                   VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> +    if (ret) {
> +        error_report("Failed to set state STOPNCOPY_ACTIVE");
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    do {
> +        vfio_update_pending(vbasedev, ms->threshold_size);
> +
> +        if (vfio_is_active_iterate(opaque)) {
> +            ret = vfio_save_buffer(f, vbasedev);
> +            if (ret < 0) {
> +                error_report("Failed to save buffer");
> +                break;
> +            } else if (ret == 0) {
> +                break;
> +            }
> +        }
> +    } while ((migration->pending_compatible +
> + migration->pending_postcopy) > 0);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                   VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED);
> +    if (ret) {
> +        error_report("Failed to set state SAVE_COMPLETED");
> +        return ret;
> +    }
> +    return ret;
> +}
> +
> +static void vfio_save_cleanup(void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_migration_region_exit(vbasedev);
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +    uint64_t data;
> +
> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return 0;
> +            } else {
> +                error_report("SETUP STATE: EOS not found 0x%lx", data);
> +                return -EINVAL;
> +            }
> +        } else if (data != 0) {
> +            VFIOMigration *migration = vbasedev->migration;
> +            VFIORegion *region = &migration->region.buffer;
> +            struct vfio_device_migration_info migration_info;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +
> +            migration_info.data.size = data;
> +
> +            if (region->mmaps) {
> +                int i;
> +
> +                for (i = 0; i < region->nr_mmaps; i++) {
> +                    if (region->mmaps[i].mmap &&
> +                        (region->mmaps[i].size >= data)) {
> +                        buf = region->mmaps[i].mmap;
> +                        migration_info.data.offset = region->mmaps[i].offset;
> +                        buffer_mmaped = true;
> +                        break;
> +                    }
> +                }
> +            }
> +
> +            if (!buffer_mmaped) {
> +                buf = g_malloc0(migration_info.data.size);
> +                migration_info.data.offset = sizeof(migration_info) + 1;
> +            }
> +
> +            qemu_get_buffer(f, buf, data);
> +
> +            ret = pwrite(vbasedev->fd, &migration_info.data,
> +                         sizeof(migration_info.data),
> +                         region->fd_offset +
> +                         offsetof(struct vfio_device_migration_info, data));
> +            if (ret != sizeof(migration_info.data)) {
> +                error_report("Failed to set migration buffer information %d",
> +                        ret);
> +                return -EINVAL;
> +            }
> +
> +            if (!buffer_mmaped) {
> +                ret = pwrite(vbasedev->fd, buf, migration_info.data.size,
> +                             region->fd_offset + migration_info.data.offset);
> +                g_free(buf);
> +
> +                if (ret != migration_info.data.size) {
> +                    error_report("Failed to set migration buffer %d", ret);
> +                    return -EINVAL;
> +                }
> +            }
> +        }
> +
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +        data = qemu_get_be64(f);
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_load_setup(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                    VFIO_DEVICE_STATE_MIGRATION_RESUME);
> +    if (ret) {
> +        error_report("Failed to set state RESUME");
> +    }
> +
> +    ret = vfio_migration_region_init(vbasedev);
> +    if (ret) {
> +        error_report("Failed to initialise migration region");
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_load_cleanup(void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    int ret = 0;
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                 VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED);
> +    if (ret) {
> +        error_report("Failed to set state RESUME_COMPLETED");
> +    }
> +
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_pending = vfio_save_pending,
> +    .save_cleanup = vfio_save_cleanup,
> +    .load_state = vfio_load_state,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .is_active_iterate = vfio_is_active_iterate, };
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState
> +state) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running) && running) {
> +        int ret;
> +
> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("Failed to set state RUNNING");
> +        }
> +    }
> +
> +    vbasedev->vm_running = running;
> +}
> +
> +static void vfio_migration_state_notifier(Notifier *notifier, void
> +*data) {
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_SETUP:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_SETUP);
> +        if (ret) {
> +            error_report("Failed to set state SETUP");
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_ACTIVE:
> +        if (vbasedev->device_state == VFIO_DEVICE_STATE_MIGRATION_SETUP) {
> +            if (vbasedev->vm_running) {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                          VFIO_DEVICE_STATE_MIGRATION_PRECOPY);
> +                if (ret) {
> +                    error_report("Failed to set state PRECOPY_ACTIVE");
> +                }
> +            } else {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                        VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> +                if (ret) {
> +                    error_report("Failed to set state STOPNCOPY_ACTIVE");
> +                }
> +            }
> +        } else {
> +            ret = vfio_migration_set_state(vbasedev,
> +                                           VFIO_DEVICE_STATE_MIGRATION_RESUME);
> +            if (ret) {
> +                error_report("Failed to set state RESUME");
> +            }
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_CANCELLED);
> +        if (ret) {
> +            error_report("Failed to set state CANCELLED");
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_FAILED);
> +        if (ret) {
> +            error_report("Failed to set state FAILED");
> +        }
> +        return;
> +    }
> +}
> +
> +static int vfio_migration_init(VFIODevice *vbasedev,
> +                               struct vfio_region_info *info) {
> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> +    vbasedev->migration->region.index = info->index;
> +
> +    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
> +    vbasedev->vm_state =
> qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
> +
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> +
> +    return 0;
> +}
> +
> +
> +/*
> +----------------------------------------------------------------------
> +*/
> +
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) {
> +    struct vfio_region_info *info;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    if (ret) {
> +        Error *local_err = NULL;
> +
> +        error_setg(&vbasedev->migration_blocker,
> +                   "VFIO device doesn't support migration");
> +        ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            error_free(vbasedev->migration_blocker);
> +            return ret;
> +        }
> +    } else {
> +        return vfio_migration_init(vbasedev, info);
> +    }
> +
> +    return 0;
> +}
> +
> +void vfio_migration_finalize(VFIODevice *vbasedev) {
> +    if (!vbasedev->migration) {
> +        return;
> +    }
> +
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
> +    if (vbasedev->migration_blocker) {
> +        migrate_del_blocker(vbasedev->migration_blocker);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +
> +    g_free(vbasedev->migration);
> +}
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a9036929b220..ab8217c9e249 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -30,6 +30,8 @@
>  #include <linux/vfio.h>
>  #endif
> 
> +#include "sysemu/sysemu.h"
> +
>  #define ERR_PREFIX "vfio error: %s: "
>  #define WARN_PREFIX "vfio warning: %s: "
> 
> @@ -57,6 +59,16 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */  } VFIORegion;
> 
> +typedef struct VFIOMigration {
> +    struct {
> +        VFIORegion buffer;
> +        uint32_t index;
> +    } region;
> +    uint64_t pending_precopy_only;
> +    uint64_t pending_compatible;
> +    uint64_t pending_postcopy;
> +} VFIOMigration;
> +
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers; @@ -116,6 +128,12 @@ typedef
> struct VFIODevice {
>      unsigned int num_irqs;
>      unsigned int num_regions;
>      unsigned int flags;
> +    uint32_t device_state;
> +    VMChangeStateEntry *vm_state;
> +    int vm_running;
> +    Notifier migration_state;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>  } VFIODevice;
Maybe we can combine the  "device_state" and "vm_running" into one state in qemu, meaning that
don't record vm_running state in struct VFIODevice, maintain device_state with multiple migration states only in qemu's VFIODevice.
only simply and clear states like DeviceStart/Stop/Migrate in/Migrate out will be sent to underlying vendor driver.



>  struct VFIODeviceOps {
> @@ -193,4 +211,9 @@ int vfio_spapr_create_window(VFIOContainer
> *container,  int vfio_spapr_remove_window(VFIOContainer *container,
>                               hwaddr offset_within_address_space);
> 
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp); void
> +vfio_migration_finalize(VFIODevice *vbasedev); void
> +vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_addr,
> +                               uint64_t pfn_count);
> +
>  #endif /* HW_VFIO_VFIO_COMMON_H */
> --
> 2.7.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Kirti Wankhede
  2018-11-21  0:26   ` Tian, Kevin
  2018-11-21 17:26   ` Pierre Morel
@ 2018-11-22 18:54   ` Dr. David Alan Gilbert
  2018-11-22 20:43     ` Kirti Wankhede
  2018-11-23  5:47   ` Zhao Yan
  2018-11-27 19:52   ` Alex Williamson
  4 siblings, 1 reply; 32+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-22 18:54 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, qemu-devel

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined actions and members of structure usage for each action:
>     * To convey VFIO device state to be transitioned to.
>     * To get pending bytes yet to be migrated for VFIO device
>     * To ask driver to write data to migration region and return number of bytes
>       written in the region
>     * In migration resume path, user space app writes to migration region and
>       communicates it to vendor driver.
>     * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>

<snip>

> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.

<snip>

> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;

I'm curious how the offsets/size work; how does the 
kernel driver know the maximum size of state it's allowed to write?
Why would it pick a none-0 offset into the output region?

Without having dug further these feel like i/o rather than just output;
i.e. the calling process says 'put it at that offset and you've got size
bytes' and the kernel replies with 'I did put it at offset and I wrote
only this size bytes'

Dave

> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices Kirti Wankhede
  2018-11-21  7:39   ` Zhao, Yan Y
  2018-11-22  8:22   ` Zhao, Yan Y
@ 2018-11-22 19:57   ` Dr. David Alan Gilbert
  2018-11-29  8:04   ` Zhao Yan
  2018-12-17 11:19   ` Gonglei (Arei)
  4 siblings, 0 replies; 32+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-22 19:57 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, qemu-devel

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supportd migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read data offset and size using structure
>   vfio_device_migration_info.data, accordingly copy data from the region.
> - To restore device state, write data offset and size in the structure and write
>   data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count of
>   pfns copied and accordingly read those from the rest of the region or mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>

This does need something for device/data versioning.
Please consider adding some 'trace' calls to make it easier to
debug in situ.
Splitting the patch a bit more would also help; see some more comments
below.


> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 729 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 753 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index a2e7a0a7cf02..2cf2ba1440f2 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,5 +1,5 @@
>  ifeq ($(CONFIG_LINUX), y)
> -obj-$(CONFIG_SOFTMMU) += common.o
> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..717fb63e4f43
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,729 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2018
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.buffer.size) {
> +        vfio_region_exit(&migration->region.buffer);
> +        vfio_region_finalize(&migration->region.buffer);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!migration) {
> +        return ret;

Here and ...
> +    }
> +
> +    /* Migration support added for PCI device only */
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        obj = vfio_pci_get_object(vbasedev);
> +    }
> +
> +    if (!obj) {
> +        return ret;

Here, you've failed the migration but not printed an error to say why;
please print something so we cna tell what happened.

> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
> +                            migration->region.index, "migration");
> +    if (ret) {
> +        error_report("Failed to setup VFIO migration region %d: %s",
> +                      migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.buffer.size) {
> +        ret = -EINVAL;
> +        error_report("Invalid region size of VFIO migration region %d: %s",
> +                     migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (migration->region.buffer.mmaps) {
> +        ret = vfio_region_mmap(&migration->region.buffer);
> +        if (ret) {
> +            error_report("Failed to mmap VFIO migration region %d: %s",
> +                         migration->region.index, strerror(-ret));
> +            goto err;
> +        }
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    int ret = 0;
> +
> +    if (vbasedev->device_state == state) {
> +        return ret;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &state, sizeof(state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("Failed to set migration state %d %s",
> +                     ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    vbasedev->device_state = state;
> +    return ret;
> +}
> +
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> +                              uint64_t start_addr,
> +                              uint64_t pfn_count)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    uint64_t count = 0;
> +    int ret;
> +
> +    migration_info.dirty_pfns.start_addr = start_addr;
> +    migration_info.dirty_pfns.total = pfn_count;
> +
> +    ret = pwrite(vbasedev->fd, &migration_info.dirty_pfns,
> +                 sizeof(migration_info.dirty_pfns),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              dirty_pfns));

Do you also have to check the return value is equal to the size?

> +    if (ret < 0) {
> +        error_report("Failed to set dirty pages start address %d %s",
> +                ret, strerror(errno));
> +        return;
> +    }
> +
> +    do {
> +        /* Read dirty_pfns.copied */
> +        ret = pread(vbasedev->fd, &migration_info.dirty_pfns,
> +                sizeof(migration_info.dirty_pfns),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             dirty_pfns));
> +        if (ret < 0) {
> +            error_report("Failed to get dirty pages bitmap count %d %s",
> +                    ret, strerror(errno));
> +            return;
> +        }
> +
> +        if (migration_info.dirty_pfns.copied) {
> +            uint64_t bitmap_size;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +
> +            bitmap_size = (BITS_TO_LONGS(migration_info.dirty_pfns.copied) + 1)
> +                           * sizeof(unsigned long);
> +
> +            if (region->mmaps) {
> +                int i;
> +
> +                for (i = 0; i < region->nr_mmaps; i++) {
> +                    if (region->mmaps[i].size >= bitmap_size) {
> +                        buf = region->mmaps[i].mmap;
> +                        buffer_mmaped = true;
> +                        break;
> +                    }
> +                }
> +            }
> +
> +            if (!buffer_mmaped) {
> +                buf = g_malloc0(bitmap_size);
> +
> +                ret = pread(vbasedev->fd, buf, bitmap_size,
> +                            region->fd_offset + sizeof(migration_info) + 1);

Why the +1 ?

> +                if (ret != bitmap_size) {
> +                    error_report("Failed to get migration data %d", ret);
> +                    g_free(buf);
> +                    return;
> +                }
> +            }
> +
> +            cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> +                                        start_addr + (count * TARGET_PAGE_SIZE),
> +                                        migration_info.dirty_pfns.copied);
> +            count +=  migration_info.dirty_pfns.copied;
> +
> +            if (!buffer_mmaped) {
> +                g_free(buf);
> +            }
> +        }
> +    } while (count < migration_info.dirty_pfns.total);
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_save_config(vbasedev, f);
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}
> +
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_load_config(vbasedev, f);
> +    }
> +
> +    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("Wrong end of block while loading device config space");
> +        return -EINVAL;
> +    }
> +
> +    return qemu_file_get_error(f);
> +}
> +
> +/* ---------------------------------------------------------------------- */
> +
> +static bool vfio_is_active_iterate(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->vm_running && vbasedev->migration &&
> +        (vbasedev->migration->pending_precopy_only != 0))
> +        return true;
> +
> +    if (!vbasedev->vm_running && vbasedev->migration &&
> +        (vbasedev->migration->pending_postcopy != 0))
> +        return true;
> +
> +    return false;
> +}
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    qemu_mutex_lock_iothread();
> +    ret = vfio_migration_region_init(vbasedev);
> +    qemu_mutex_unlock_iothread();
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>
You will eventually regret using lots of get's and put's - they're
a pain to debug when it goes wrong.

> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &migration_info.data,
> +                sizeof(migration_info.data),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data));
> +    if (ret != sizeof(migration_info.data)) {
> +        error_report("Failed to get migration buffer information %d",
> +                     ret);
> +        return -EINVAL;
> +    }
> +
> +    if (migration_info.data.size) {
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        if (region->mmaps) {
> +            int i;
> +
> +            for (i = 0; i < region->nr_mmaps; i++) {
> +                if (region->mmaps[i].offset == migration_info.data.offset) {
> +                    buf = region->mmaps[i].mmap;
> +                    buffer_mmaped = true;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        if (!buffer_mmaped) {
> +            buf = g_malloc0(migration_info.data.size);

How big are these chunks? If they're larger than about a page please
consider using g_try_malloc0 and checking for allocation failures.

> +            ret = pread(vbasedev->fd, buf, migration_info.data.size,
> +                        region->fd_offset + migration_info.data.offset);
> +            if (ret != migration_info.data.size) {
> +                error_report("Failed to get migration data %d", ret);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, migration_info.data.size);
> +        qemu_put_buffer(f, buf, migration_info.data.size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +
> +    } else {
> +        qemu_put_be64(f, migration_info.data.size);
> +    }
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return migration_info.data.size;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    ret = vfio_save_buffer(f, vbasedev);
> +    if (ret < 0) {
> +        error_report("vfio_save_buffer failed %s",
> +                     strerror(errno));
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return ret;
> +}
> +
> +static void vfio_update_pending(VFIODevice *vbasedev, uint64_t threshold_size)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    int ret;
> +
> +    ret = pwrite(vbasedev->fd, &threshold_size, sizeof(threshold_size),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              pending.threshold_size));
> +    if (ret < 0) {
> +        error_report("Failed to set threshold size %d %s",
> +                     ret, strerror(errno));
> +        return;
> +    }
> +
> +    ret = pread(vbasedev->fd, &migration_info.pending,
> +                sizeof(migration_info.pending),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending));
> +    if (ret != sizeof(migration_info.pending)) {
> +        error_report("Failed to get pending bytes %d", ret);
> +        return;
> +    }
> +
> +    migration->pending_precopy_only = migration_info.pending.precopy_only;
> +    migration->pending_compatible = migration_info.pending.compatible;
> +    migration->pending_postcopy = migration_info.pending.postcopy_only;
> +
> +    return;
> +}
> +
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    vfio_update_pending(vbasedev, threshold_size);
> +
> +    *res_precopy_only += migration->pending_precopy_only;
> +    *res_compatible += migration->pending_compatible;
> +    *res_postcopy_only += migration->pending_postcopy;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    MigrationState *ms = migrate_get_current();
> +    int ret;
> +
> +    if (vbasedev->vm_running) {
> +        vbasedev->vm_running = 0;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                   VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> +    if (ret) {
> +        error_report("Failed to set state STOPNCOPY_ACTIVE");
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    do {
> +        vfio_update_pending(vbasedev, ms->threshold_size);
> +
> +        if (vfio_is_active_iterate(opaque)) {
> +            ret = vfio_save_buffer(f, vbasedev);
> +            if (ret < 0) {
> +                error_report("Failed to save buffer");
> +                break;
> +            } else if (ret == 0) {
> +                break;
> +            }
> +        }
> +    } while ((migration->pending_compatible + migration->pending_postcopy) > 0);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                   VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED);
> +    if (ret) {
> +        error_report("Failed to set state SAVE_COMPLETED");
> +        return ret;
> +    }
> +    return ret;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_migration_region_exit(vbasedev);
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +    uint64_t data;
> +
> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return 0;
> +            } else {
> +                error_report("SETUP STATE: EOS not found 0x%lx", data);
> +                return -EINVAL;
> +            }
> +        } else if (data != 0) {
> +            VFIOMigration *migration = vbasedev->migration;
> +            VFIORegion *region = &migration->region.buffer;
> +            struct vfio_device_migration_info migration_info;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +
> +            migration_info.data.size = data;
> +
> +            if (region->mmaps) {
> +                int i;
> +
> +                for (i = 0; i < region->nr_mmaps; i++) {
> +                    if (region->mmaps[i].mmap &&
> +                        (region->mmaps[i].size >= data)) {
> +                        buf = region->mmaps[i].mmap;
> +                        migration_info.data.offset = region->mmaps[i].offset;
> +                        buffer_mmaped = true;
> +                        break;
> +                    }
> +                }
> +            }
> +
> +            if (!buffer_mmaped) {
> +                buf = g_malloc0(migration_info.data.size);
> +                migration_info.data.offset = sizeof(migration_info) + 1;
> +            }
> +
> +            qemu_get_buffer(f, buf, data);
> +
> +            ret = pwrite(vbasedev->fd, &migration_info.data,
> +                         sizeof(migration_info.data),
> +                         region->fd_offset +
> +                         offsetof(struct vfio_device_migration_info, data));
> +            if (ret != sizeof(migration_info.data)) {
> +                error_report("Failed to set migration buffer information %d",
> +                        ret);
> +                return -EINVAL;
> +            }
> +
> +            if (!buffer_mmaped) {
> +                ret = pwrite(vbasedev->fd, buf, migration_info.data.size,
> +                             region->fd_offset + migration_info.data.offset);
> +                g_free(buf);
> +
> +                if (ret != migration_info.data.size) {
> +                    error_report("Failed to set migration buffer %d", ret);
> +                    return -EINVAL;
> +                }
> +            }
> +        }
> +
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +        data = qemu_get_be64(f);
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                    VFIO_DEVICE_STATE_MIGRATION_RESUME);
> +    if (ret) {
> +        error_report("Failed to set state RESUME");
> +    }
> +
> +    ret = vfio_migration_region_init(vbasedev);
> +    if (ret) {
> +        error_report("Failed to initialise migration region");
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret = 0;
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                 VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED);
> +    if (ret) {
> +        error_report("Failed to set state RESUME_COMPLETED");
> +    }
> +
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_pending = vfio_save_pending,
> +    .save_cleanup = vfio_save_cleanup,
> +    .load_state = vfio_load_state,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .is_active_iterate = vfio_is_active_iterate,
> +};
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running) && running) {
> +        int ret;
> +
> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("Failed to set state RUNNING");
> +        }
> +    }
> +
> +    vbasedev->vm_running = running;
> +}
> +
> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_SETUP:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_SETUP);
> +        if (ret) {
> +            error_report("Failed to set state SETUP");
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_ACTIVE:
> +        if (vbasedev->device_state == VFIO_DEVICE_STATE_MIGRATION_SETUP) {
> +            if (vbasedev->vm_running) {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                          VFIO_DEVICE_STATE_MIGRATION_PRECOPY);
> +                if (ret) {
> +                    error_report("Failed to set state PRECOPY_ACTIVE");
> +                }
> +            } else {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                        VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> +                if (ret) {
> +                    error_report("Failed to set state STOPNCOPY_ACTIVE");
> +                }
> +            }
> +        } else {
> +            ret = vfio_migration_set_state(vbasedev,
> +                                           VFIO_DEVICE_STATE_MIGRATION_RESUME);
> +            if (ret) {
> +                error_report("Failed to set state RESUME");
> +            }
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_CANCELLED);
> +        if (ret) {
> +            error_report("Failed to set state CANCELLED");
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_FAILED);
> +        if (ret) {
> +            error_report("Failed to set state FAILED");
> +        }
> +        return;
> +    }

This lot looks like it would be easier to have a :

    newstate...

   switch (..) {
     case 
      newstate = ...
     case 
      newstate = ...
     case 
      newstate = ...
  }

  ret = vfio_migration_Set_state(newstate);
 
> +}
> +
> +static int vfio_migration_init(VFIODevice *vbasedev,
> +                               struct vfio_region_info *info)
> +{
> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> +    vbasedev->migration->region.index = info->index;
> +
> +    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
> +
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> +
> +    return 0;
> +}
> +
> +
> +/* ---------------------------------------------------------------------- */
> +
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> +{
> +    struct vfio_region_info *info;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    if (ret) {
> +        Error *local_err = NULL;
> +
> +        error_setg(&vbasedev->migration_blocker,
> +                   "VFIO device doesn't support migration");
> +        ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            error_free(vbasedev->migration_blocker);
> +            return ret;
> +        }
> +    } else {
> +        return vfio_migration_init(vbasedev, info);
> +    }
> +
> +    return 0;
> +}
> +
> +void vfio_migration_finalize(VFIODevice *vbasedev)
> +{
> +    if (!vbasedev->migration) {
> +        return;
> +    }
> +
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
> +    if (vbasedev->migration_blocker) {
> +        migrate_del_blocker(vbasedev->migration_blocker);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +
> +    g_free(vbasedev->migration);
> +}
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a9036929b220..ab8217c9e249 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -30,6 +30,8 @@
>  #include <linux/vfio.h>
>  #endif
>  
> +#include "sysemu/sysemu.h"
> +
>  #define ERR_PREFIX "vfio error: %s: "
>  #define WARN_PREFIX "vfio warning: %s: "
>  
> @@ -57,6 +59,16 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */
>  } VFIORegion;
>  
> +typedef struct VFIOMigration {
> +    struct {
> +        VFIORegion buffer;
> +        uint32_t index;
> +    } region;
> +    uint64_t pending_precopy_only;
> +    uint64_t pending_compatible;
> +    uint64_t pending_postcopy;
> +} VFIOMigration;
> +
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers;
> @@ -116,6 +128,12 @@ typedef struct VFIODevice {
>      unsigned int num_irqs;
>      unsigned int num_regions;
>      unsigned int flags;
> +    uint32_t device_state;
> +    VMChangeStateEntry *vm_state;
> +    int vm_running;
> +    Notifier migration_state;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> @@ -193,4 +211,9 @@ int vfio_spapr_create_window(VFIOContainer *container,
>  int vfio_spapr_remove_window(VFIOContainer *container,
>                               hwaddr offset_within_address_space);
>  
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> +void vfio_migration_finalize(VFIODevice *vbasedev);
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_addr,
> +                               uint64_t pfn_count);
> +
>  #endif /* HW_VFIO_VFIO_COMMON_H */
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] Add vfio_listerner_log_sync to mark dirty pages
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 4/5] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
@ 2018-11-22 20:00   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 32+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-22 20:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, qemu-devel

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> vfio_listerner_log_sync gets list of dirty pages from vendor driver and mark
> those pages dirty.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fb396cf00ac4..338aad7426f0 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -697,9 +697,41 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static void vfio_listerner_log_sync(MemoryListener *listener,
> +                                    MemoryRegionSection *section)
> +{
> +    uint64_t start_addr, size, pfn_count;
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            switch (vbasedev->device_state) {
> +            case VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> +            case VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> +                    continue;
> +
> +            default:
> +                    return;
> +            }
> +        }
> +    }

Is that big loop just trying to find devices not in migration?
Some comments would be good.

Dave

> +    start_addr = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    size = int128_get64(section->size);
> +    pfn_count = size >> TARGET_PAGE_BITS;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vfio_get_dirty_page_list(vbasedev, start_addr, pfn_count);
> +        }
> +    }
> +}
> +
>  static const MemoryListener vfio_memory_listener = {
>      .region_add = vfio_listener_region_add,
>      .region_del = vfio_listener_region_del,
> +    .log_sync = vfio_listerner_log_sync,
>  };
>  
>  static void vfio_listener_release(VFIOContainer *container)
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-21  6:13       ` Tian, Kevin
@ 2018-11-22 20:01         ` Kirti Wankhede
  2018-11-26  7:14           ` Tian, Kevin
  0 siblings, 1 reply; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-22 20:01 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, cjia
  Cc: Yang, Ziye, Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet,
	cohuck, dgilbert, jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang, Zhi A, qemu-devel



On 11/21/2018 11:43 AM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, November 21, 2018 12:24 PM
>>
>>
>> On 11/21/2018 5:56 AM, Tian, Kevin wrote:
>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>>>> Sent: Wednesday, November 21, 2018 4:40 AM
>>>>
>>>> - Defined MIGRATION region type and sub-type.
>>>> - Defined VFIO device states during migration process.
>>>> - Defined vfio_device_migration_info structure which will be placed at
>> 0th
>>>>   offset of migration region to get/set VFIO device related information.
>>>>   Defined actions and members of structure usage for each action:
>>>>     * To convey VFIO device state to be transitioned to.
>>>>     * To get pending bytes yet to be migrated for VFIO device
>>>>     * To ask driver to write data to migration region and return number of
>>>> bytes
>>>>       written in the region
>>>>     * In migration resume path, user space app writes to migration region
>>>> and
>>>>       communicates it to vendor driver.
>>>>     * Get bitmap of dirty pages from vendor driver from given start
>> address
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>  linux-headers/linux/vfio.h | 130
>>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 130 insertions(+)
>>>>
>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>> index 3615a269d378..a6e45cb2cae2 100644
>>>> --- a/linux-headers/linux/vfio.h
>>>> +++ b/linux-headers/linux/vfio.h
>>>> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>>
>>>> +/* Migration region type and sub-type */
>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>>>> +
>>>>  /*
>>>>   * The MSIX mappable capability informs that MSIX data of a BAR can be
>>>> mmapped
>>>>   * which allows direct access to non-MSIX registers which happened to
>> be
>>>> within
>>>> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
>>>>
>>>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE
>>>> + 16)
>>>>
>>>> +/**
>>>> + * VFIO device states :
>>>> + * VFIO User space application should set the device state to indicate
>>>> vendor
>>>> + * driver in which state the VFIO device should transitioned.
>>>> + * - VFIO_DEVICE_STATE_NONE:
>>>> + *   State when VFIO device is initialized but not yet running.
>>>> + * - VFIO_DEVICE_STATE_RUNNING:
>>>> + *   Transition VFIO device in running state, that is, user space
>> application
>>>> or
>>>> + *   VM is active.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
>>>> + *   Transition VFIO device in migration setup state. This is used to
>> prepare
>>>> + *   VFIO device for migration while application or VM and vCPUs are
>> still in
>>>> + *   running state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
>>>> + *   When VFIO user space application or VM is active and vCPUs are
>>>> running,
>>>> + *   transition VFIO device in pre-copy state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
>>>> + *   When VFIO user space application or VM is stopped and vCPUs are
>>>> halted,
>>>> + *   transition VFIO device in stop-and-copy state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
>>>> + *   When VFIO user space application has copied data provided by
>> vendor
>>>> driver.
>>>> + *   This state is used by vendor driver to clean up all software state that
>>>> was
>>>> + *   setup during MIGRATION_SETUP state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
>>>> + *   Transition VFIO device to resume state, that is, start resuming VFIO
>>>> device
>>>> + *   when user space application or VM is not running and vCPUs are
>>>> halted.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
>>>> + *   When user space application completes iterations of providing
>> device
>>>> state
>>>> + *   data, transition device in resume completed state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
>>>> + *   Migration process failed due to some reason, transition device to
>>>> failed
>>>> + *   state. If migration process fails while saving at source, resume
>> device
>>>> at
>>>> + *   source. If migration process fails while resuming application or VM
>> at
>>>> + *   destination, stop restoration at destination and resume at source.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
>>>> + *   User space application has cancelled migration process either for
>> some
>>>> + *   known reason or due to user's intervention. Transition device to
>>>> Cancelled
>>>> + *   state, that is, resume device state as it was during running state at
>>>> + *   source.
>>>> + */
>>>> +
>>>> +enum {
>>>> +    VFIO_DEVICE_STATE_NONE,
>>>> +    VFIO_DEVICE_STATE_RUNNING,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
>>>> +};
>>>
>>> We discussed in KVM forum to define the interfaces around the state
>>> itself, instead of around live migration flow. Looks this version doesn't
>>> move that way?
>>>
>>
>> This is patch series is along the discussion we had at KVM forum.
>>
>>> quote the summary from Alex, which though high level but simple
>>> enough to demonstrate the idea:
>>>
>>> --
>>> Here we would define "registers" for putting the device in various
>>> states through the migration process, for example enabling dirty logging,
>>> suspending the device, resuming the device, direction of data flow
>>> through the device state area, etc.
>>> --
>>>
>>
>> Defined a packed structure here to map it at 0th offset of migration
>> region so that offset can be calculated by offset_of(), you may call
>> same as register definitions.
> 
> yes, this part is a good change. My comment was around state definition
> itself.
> 
>>
>>
>>> based on that we just need much fewer states, e.g. {RUNNING,
>>> RUNNING_DIRTYLOG, STOPPED}. data flow direction doesn't need
>>> to be a state. could just a flag in the region.
>>
>> Flag is not preferred here, multiple flags can be set at a time.
>> Here need finite states with its proper definition what that device
>> state means to driver and user space application.
>> For Intel or others who don't need other states can ignore the state in
>> driver by taking no action on pwrite on .device_state offset. For
>> example for Intel driver could only take action on state change to
>> VFIO_DEVICE_STATE_RUNNING and
>> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY
>>
>> I think dirty page logging is not a VFIO device's state.
>> .log_sync of MemoryListener is called during both :
>> - PRECOPY phase i.e. while vCPUs are still running and
>> - during STOPNCOPY phase i.e. when vCPUs are stopped.
>>
>>
>>> Those are sufficient to
>>> enable vGPU live migration on Intel platform. nvidia or other vendors
>>> may have more requirements, which could lead to addition of new
>>> states - but again, they should be defined in a way not tied to migration
>>> flow.
>>>
>>
>> I had tried to explain the intend of each state. Please go through the
>> comments above.
>> Also please take a look at other patches, mainly
>> 0003-Add-migration-functions-for-VFIO-devices.patch to understand why
>> these states are required.
>>
> 
> I looked at the explanations in this patch, but still didn't get the intention, e.g.:
> 
> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.
> 
> what preparation is actually required? any example?

Each vendor driver can have different requirements as to how to prepare
for migration. For example, this phase can be used to allocate buffer
which can be mapped to MIGRATION region's data part, and allocating
staging buffer. Driver might need to spawn thread which would start
collecting data that need to be send during pre-copy phase.

> 
> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are running,
> + *   transition VFIO device in pre-copy state.
> 
> why does device driver need know this stage? in precopy phase, the VM
> is still running. Just dirty page tracking is in progress. the dirty bitmap could
> be retrieved through its own action interface.
> 

All mdev devices are not similar. Pre-copy phase is not just about dirty
page tracking. For devices which have memory on device could transfer
data from that memory during pre-copy phase. For example, NVIDIA GPU has
its own FB, so need to start sending FB data during pre-copy phase and
then during stop and copy phase send data from FB which is marked dirty
after that was copied in pre-copy phase. That helps to reduce total down
time.

> you have code to demonstrate how those states are transitioned in Qemu,
> but you didn't show evidence why those states are necessary in device side,
> which leads to the puzzle whether the definition is over-killed and limiting.
> 

I'm trying to keep these interfaces generic for VFIO and mdev devices.
Its difficult to define what vendor driver should do for each state,
each vendor driver have their own requirements. Vendor drivers should
decide whether to take any action on state transition or not.

> the flow in my mind is like below:
> 
> 1. an interface to turn on/off dirty page tracking on VFIO device:
> 	* vendor driver can do whatever required to enable device specific
> dirty page tracking mechanism here
> 	* device state is not changed here. still in running state
> 
> 2. an interface to get dirty page bitmap
>

I don't think there should be on/off interface for dirty page tracking.
If there is a write access on dirty_pfns.start_addr and dirty_pfns.total
and device_state >=VFIO_DEVICE_STATE_MIGRATION_SETUP && device_state <=
VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY then dirty page tracking has
started, so return dirty page bitmap in data part of migration region.


> 3. an interface to start/stop device activity
> 	* the effect of stop is to stop and drain in-the-fly device activities and
> make device state ready for dump-out. vendor driver can do specific preparation 
> here

VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY is to stop the device, but as I
mentioned above some vendor driver might have to do preparation before
pre-copy phase starts.

> 	* the effect of start is to check validity of device state and then resume
> device activities. again, vendor driver can do specific cleanup/preparation here
>

That is VFIO_DEVICE_STATE_MIGRATION_RESUME.

Defined VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED and
VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED states to cleanup all that
which was allocated/mmapped/started thread during setup phase. This can
be moved to transition to _RUNNING state. So if all agrees these states
can be removed.


> 4. an interface to save/restore device state
> 	* should happen when device is stopped
> 	* of course there is still an open how to check state compatibility as
> Alex pointed earlier
>

I hope above explains why other states are required.

Thanks,
Kirti


> this way above interfaces are not tied to migration. other usages which are
> interested in device state could also work (e.g. snapshot). If it doesn't work
> with your device, it's better that you can elaborate your requirement with more
> concrete examples. Then people will judge the necessity of a more complex
> interface as proposed in this series...
> 
> Thanks
> Kevin
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-22 18:54   ` Dr. David Alan Gilbert
@ 2018-11-22 20:43     ` Kirti Wankhede
  2018-11-23 11:44       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-22 20:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: alex.williamson, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, qemu-devel



On 11/23/2018 12:24 AM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> - Defined MIGRATION region type and sub-type.
>> - Defined VFIO device states during migration process.
>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>   offset of migration region to get/set VFIO device related information.
>>   Defined actions and members of structure usage for each action:
>>     * To convey VFIO device state to be transitioned to.
>>     * To get pending bytes yet to be migrated for VFIO device
>>     * To ask driver to write data to migration region and return number of bytes
>>       written in the region
>>     * In migration resume path, user space app writes to migration region and
>>       communicates it to vendor driver.
>>     * Get bitmap of dirty pages from vendor driver from given start address
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> 
> <snip>
> 
>> + * Action Get buffer:
>> + *      On this action, vendor driver should write data to migration region and
>> + *      return number of bytes written in the region.
>> + *      data.offset [output] : offset in the region from where data is written.
>> + *      data.size [output] : number of bytes written in migration buffer by
>> + *          vendor driver.
> 
> <snip>
> 
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +        __u32 device_state;         /* VFIO device state */
>> +        struct {
>> +            __u64 precopy_only;
>> +            __u64 compatible;
>> +            __u64 postcopy_only;
>> +            __u64 threshold_size;
>> +        } pending;
>> +        struct {
>> +            __u64 offset;           /* offset */
>> +            __u64 size;             /* size */
>> +        } data;
> 
> I'm curious how the offsets/size work; how does the 
> kernel driver know the maximum size of state it's allowed to write?


Migration region looks like:
 ----------------------------------------------------------------------
|vfio_device_migration_info|    data section			      |	
|                          |     ///////////////////////////////////  |
 ----------------------------------------------------------------------
 ^			         ^                                 ^
 offset 0-trapped part         data.offset                     data.size


Kernel driver defines the size of migration region and tells VFIO user
space application (QEMU here) through VFIO_DEVICE_GET_REGION_INFO ioctl.
So kernel driver can calculate the size of data section. Then kernel
driver can have (data.size >= data section size) or (data.size < data
section size), hence VFIO user space application need to know data.size
to copy only relevant data.

> Why would it pick a none-0 offset into the output region?

Data section is always followed by vfio_device_migration_info structure
in the region, so data.offset will always be none-0.
Offset from where data is copied is decided by kernel driver, data
section can be trapped or mapped depending on how kernel driver defines
data section. If mmapped, then data.offset should be page aligned, where
as initial section which contain vfio_device_migration_info structure
might not end at offset which is page aligned.

Thanks,
Kirti

> Without having dug further these feel like i/o rather than just output;
> i.e. the calling process says 'put it at that offset and you've got size
> bytes' and the kernel replies with 'I did put it at offset and I wrote
> only this size bytes'
> 
> Dave
> 
>> +        struct {
>> +            __u64 start_addr;
>> +            __u64 total;
>> +            __u64 copied;
>> +        } dirty_pfns;
>> +} __attribute__((packed));
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>  
>>  /**
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device
  2018-11-21  5:47 ` [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Peter Xu
@ 2018-11-22 21:01   ` Kirti Wankhede
  0 siblings, 0 replies; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-22 21:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: alex.williamson, cjia, Zhengxiao.zx, kevin.tian, yi.l.liu,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue



On 11/21/2018 11:17 AM, Peter Xu wrote:
> On Wed, Nov 21, 2018 at 02:09:38AM +0530, Kirti Wankhede wrote:
>> Add migration support for VFIO device
> 
> Hi, Kirti,
> 
> I failed to apply the series cleanly onto master.  Could you push the
> tree somewhere so that people might read the work easier?  Or would
> you tell me the base commit, then I can apply it myself.
> 

Sorry of inconvenience.
These patches are on top of v3.0.0 release (tag: v3.0.0)

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-11-21  7:39   ` Zhao, Yan Y
@ 2018-11-22 21:21     ` Kirti Wankhede
  2018-11-23  5:29       ` Zhao Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Kirti Wankhede @ 2018-11-22 21:21 UTC (permalink / raw)
  To: Zhao, Yan Y, alex.williamson, cjia
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic



On 11/21/2018 1:09 PM, Zhao, Yan Y wrote:
> 
> 
>> -----Original Message-----
>> From: Qemu-devel [mailto:qemu-devel-
>> bounces+yan.y.zhao=intel.com@nongnu.org] On Behalf Of Kirti Wankhede
>> Sent: Wednesday, November 21, 2018 4:40 AM
>> To: alex.williamson@redhat.com; cjia@nvidia.com
>> Cc: Zhengxiao.zx@Alibaba-inc.com; Tian, Kevin <kevin.tian@intel.com>; Liu, Yi L
>> <yi.l.liu@intel.com>; eskultet@redhat.com; Yang, Ziye <ziye.yang@intel.com>;
>> qemu-devel@nongnu.org; cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
>> dgilbert@redhat.com; Wang, Zhi A <zhi.a.wang@intel.com>;
>> mlevitsk@redhat.com; pasic@linux.ibm.com; aik@ozlabs.ru; Kirti Wankhede
>> <kwankhede@nvidia.com>; eauger@redhat.com; felipe@nutanix.com;
>> jonathan.davies@nutanix.com; Liu, Changpeng <changpeng.liu@intel.com>;
>> Ken.Xue@amd.com
>> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
>>
>> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
>> - Added SaveVMHandlers and implemented all basic functions required for live
>>   migration.
>> - Added VM state change handler to know running or stopped state of VM.
>> - Added migration state change notifier to get notification on migration state
>>   change. This state is translated to VFIO device state and conveyed to vendor
>>   driver.
>> - VFIO device supportd migration or not is decided based of migration region
>>   query. If migration region query is successful then migration is supported
>>   else migration is blocked.
>> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>>   region and should always trapped by VFIO device's driver. Added both type of
>>   access support, trapped or mmapped, for data section of the region.
>> - To save device state, read data offset and size using structure
>>   vfio_device_migration_info.data, accordingly copy data from the region.
>> - To restore device state, write data offset and size in the structure and write
>>   data in the region.
>> - To get dirty page bitmap, write start address and pfn count then read count of
>>   pfns copied and accordingly read those from the rest of the region or mmaped
>>   part of the region. This copy is iterated till page bitmap for all requested
>>   pfns are copied.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/Makefile.objs         |   2 +-
>>  hw/vfio/migration.c           | 729
>> ++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |  23 ++
>>  3 files changed, 753 insertions(+), 1 deletion(-)  create mode 100644
>> hw/vfio/migration.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index
>> a2e7a0a7cf02..2cf2ba1440f2 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,5 +1,5 @@
>>  ifeq ($(CONFIG_LINUX), y)
>> -obj-$(CONFIG_SOFTMMU) += common.o
>> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
>>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>>  obj-$(CONFIG_SOFTMMU) += platform.o
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644
>> index 000000000000..717fb63e4f43
>> --- /dev/null
>> +++ b/hw/vfio/migration.c
>> @@ -0,0 +1,729 @@
>> +/*
>> + * Migration support for VFIO devices
>> + *
>> + * Copyright NVIDIA, Inc. 2018
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2. See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "cpu.h"
>> +#include "migration/migration.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/register.h"
>> +#include "migration/blocker.h"
>> +#include "migration/misc.h"
>> +#include "qapi/error.h"
>> +#include "exec/ramlist.h"
>> +#include "exec/ram_addr.h"
>> +#include "pci.h"
>> +
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +
>> +static void vfio_migration_region_exit(VFIODevice *vbasedev) {
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (!migration) {
>> +        return;
>> +    }
>> +
>> +    if (migration->region.buffer.size) {
>> +        vfio_region_exit(&migration->region.buffer);
>> +        vfio_region_finalize(&migration->region.buffer);
>> +    }
>> +}
>> +
>> +static int vfio_migration_region_init(VFIODevice *vbasedev) {
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    Object *obj = NULL;
>> +    int ret = -EINVAL;
>> +
>> +    if (!migration) {
>> +        return ret;
>> +    }
>> +
>> +    /* Migration support added for PCI device only */
>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> +        obj = vfio_pci_get_object(vbasedev);
>> +    }
>> +
>> +    if (!obj) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
>> +                            migration->region.index, "migration");
>> +    if (ret) {
>> +        error_report("Failed to setup VFIO migration region %d: %s",
>> +                      migration->region.index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    if (!migration->region.buffer.size) {
>> +        ret = -EINVAL;
>> +        error_report("Invalid region size of VFIO migration region %d: %s",
>> +                     migration->region.index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    if (migration->region.buffer.mmaps) {
>> +        ret = vfio_region_mmap(&migration->region.buffer);
>> +        if (ret) {
>> +            error_report("Failed to mmap VFIO migration region %d: %s",
>> +                         migration->region.index, strerror(-ret));
>> +            goto err;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +
>> +err:
>> +    vfio_migration_region_exit(vbasedev);
>> +    return ret;
>> +}
>> +
>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t
>> +state) {
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    int ret = 0;
>> +
>> +    if (vbasedev->device_state == state) {
>> +        return ret;
>> +    }
>> +
>> +    ret = pwrite(vbasedev->fd, &state, sizeof(state),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              device_state));
>> +    if (ret < 0) {
>> +        error_report("Failed to set migration state %d %s",
>> +                     ret, strerror(errno));
>> +        return ret;
>> +    }
>> +
>> +    vbasedev->device_state = state;
>> +    return ret;
>> +}
>> +
>> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
>> +                              uint64_t start_addr,
>> +                              uint64_t pfn_count) {
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    struct vfio_device_migration_info migration_info;
>> +    uint64_t count = 0;
>> +    int ret;
>> +
>> +    migration_info.dirty_pfns.start_addr = start_addr;
>> +    migration_info.dirty_pfns.total = pfn_count;
>> +
>> +    ret = pwrite(vbasedev->fd, &migration_info.dirty_pfns,
>> +                 sizeof(migration_info.dirty_pfns),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              dirty_pfns));
>> +    if (ret < 0) {
>> +        error_report("Failed to set dirty pages start address %d %s",
>> +                ret, strerror(errno));
>> +        return;
>> +    }
>> +
>> +    do {
>> +        /* Read dirty_pfns.copied */
>> +        ret = pread(vbasedev->fd, &migration_info.dirty_pfns,
>> +                sizeof(migration_info.dirty_pfns),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             dirty_pfns));
>> +        if (ret < 0) {
>> +            error_report("Failed to get dirty pages bitmap count %d %s",
>> +                    ret, strerror(errno));
>> +            return;
>> +        }
>> +
>> +        if (migration_info.dirty_pfns.copied) {
>> +            uint64_t bitmap_size;
>> +            void *buf = NULL;
>> +            bool buffer_mmaped = false;
>> +
>> +            bitmap_size = (BITS_TO_LONGS(migration_info.dirty_pfns.copied) + 1)
>> +                           * sizeof(unsigned long);
>> +
>> +            if (region->mmaps) {
>> +                int i;
>> +
>> +                for (i = 0; i < region->nr_mmaps; i++) {
>> +                    if (region->mmaps[i].size >= bitmap_size) {
>> +                        buf = region->mmaps[i].mmap;
>> +                        buffer_mmaped = true;
>> +                        break;
>> +                    }
>> +                }
>> +            }
> What if mmapped data area is in front of mmaped dirty bit area?
> Maybe you need to record dirty bit region's index as what does for data region.
>

No, data section is not different than dirty bit area.
Migration region is like:
 ----------------------------------------------------------------------
|vfio_device_migration_info|    data section			      |	
|                          |     ///////////////////////////////////  |
 ----------------------------------------------------------------------
 ^			         ^                                 ^
 offset 0-trapped part         data.offset                     data.size

Same data section is used to copy vendor driver data during save and
during dirty page bitmap query.

>> +
>> +            if (!buffer_mmaped) {
>> +                buf = g_malloc0(bitmap_size);
>> +
>> +                ret = pread(vbasedev->fd, buf, bitmap_size,
>> +                            region->fd_offset + sizeof(migration_info) + 1);
>> +                if (ret != bitmap_size) {
>> +                    error_report("Failed to get migration data %d", ret);
>> +                    g_free(buf);
>> +                    return;
>> +                }
>> +            }
>> +
>> +            cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>> +                                        start_addr + (count * TARGET_PAGE_SIZE),
>> +                                        migration_info.dirty_pfns.copied);
>> +            count +=  migration_info.dirty_pfns.copied;
>> +
>> +            if (!buffer_mmaped) {
>> +                g_free(buf);
>> +            }
>> +        }
>> +    } while (count < migration_info.dirty_pfns.total); }
>> +
>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque) {
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>> +
>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> +        vfio_pci_save_config(vbasedev, f);
>> +    }
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque) {
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> +        vfio_pci_load_config(vbasedev, f);
>> +    }
>> +
>> +    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
>> +        error_report("Wrong end of block while loading device config space");
>> +        return -EINVAL;
>> +    }
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
> What's the purpose to add a tailing VFIO_MIG_FLAG_END_OF_STATE for each section? 
> For compatibility check?
> Maybe a version id or magic in struct vfio_device_migration_info is more appropriate?
> 

No, this is to identify the section end during resume, see
vfio_load_state().

> 
>> +/*
>> +----------------------------------------------------------------------
>> +*/
>> +
>> +static bool vfio_is_active_iterate(void *opaque) {
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    if (vbasedev->vm_running && vbasedev->migration &&
>> +        (vbasedev->migration->pending_precopy_only != 0))
>> +        return true;
>> +
>> +    if (!vbasedev->vm_running && vbasedev->migration &&
>> +        (vbasedev->migration->pending_postcopy != 0))
>> +        return true;
>> +
>> +    return false;
>> +}
>> +
>> +static int vfio_save_setup(QEMUFile *f, void *opaque) {
>> +    VFIODevice *vbasedev = opaque;
>> +    int ret;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +
>> +    qemu_mutex_lock_iothread();
>> +    ret = vfio_migration_region_init(vbasedev);
>> +    qemu_mutex_unlock_iothread();
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev) {
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    struct vfio_device_migration_info migration_info;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &migration_info.data,
>> +                sizeof(migration_info.data),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data));
>> +    if (ret != sizeof(migration_info.data)) {
>> +        error_report("Failed to get migration buffer information %d",
>> +                     ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (migration_info.data.size) {
>> +        void *buf = NULL;
>> +        bool buffer_mmaped = false;
>> +
>> +        if (region->mmaps) {
>> +            int i;
>> +
>> +            for (i = 0; i < region->nr_mmaps; i++) {
>> +                if (region->mmaps[i].offset == migration_info.data.offset) {
>> +                    buf = region->mmaps[i].mmap;
>> +                    buffer_mmaped = true;
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +
>> +        if (!buffer_mmaped) {
>> +            buf = g_malloc0(migration_info.data.size);
>> +            ret = pread(vbasedev->fd, buf, migration_info.data.size,
>> +                        region->fd_offset + migration_info.data.offset);
>> +            if (ret != migration_info.data.size) {
>> +                error_report("Failed to get migration data %d", ret);
>> +                return -EINVAL;
>> +            }
>> +        }
>> +
>> +        qemu_put_be64(f, migration_info.data.size);
>> +        qemu_put_buffer(f, buf, migration_info.data.size);
>> +
>> +        if (!buffer_mmaped) {
>> +            g_free(buf);
>> +        }
>> +
>> +    } else {
>> +        qemu_put_be64(f, migration_info.data.size);
>> +    }
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return migration_info.data.size;
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque) {
>> +    VFIODevice *vbasedev = opaque;
>> +    int ret;
>> +
>> +    ret = vfio_save_buffer(f, vbasedev);
>> +    if (ret < 0) {
>> +        error_report("vfio_save_buffer failed %s",
>> +                     strerror(errno));
>> +        return ret;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static void vfio_update_pending(VFIODevice *vbasedev, uint64_t
>> +threshold_size) {
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    struct vfio_device_migration_info migration_info;
>> +    int ret;
>> +
>> +    ret = pwrite(vbasedev->fd, &threshold_size, sizeof(threshold_size),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              pending.threshold_size));
>> +    if (ret < 0) {
>> +        error_report("Failed to set threshold size %d %s",
>> +                     ret, strerror(errno));
>> +        return;
>> +    }
>> +
>> +    ret = pread(vbasedev->fd, &migration_info.pending,
>> +                sizeof(migration_info.pending),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             pending));
>> +    if (ret != sizeof(migration_info.pending)) {
>> +        error_report("Failed to get pending bytes %d", ret);
>> +        return;
>> +    }
>> +
>> +    migration->pending_precopy_only = migration_info.pending.precopy_only;
>> +    migration->pending_compatible = migration_info.pending.compatible;
>> +    migration->pending_postcopy = migration_info.pending.postcopy_only;
>> +
>> +    return;
>> +}
>> +
>> +static void vfio_save_pending(QEMUFile *f, void *opaque,
>> +                              uint64_t threshold_size,
>> +                              uint64_t *res_precopy_only,
>> +                              uint64_t *res_compatible,
>> +                              uint64_t *res_postcopy_only) {
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    vfio_update_pending(vbasedev, threshold_size);
>> +
>> +    *res_precopy_only += migration->pending_precopy_only;
>> +    *res_compatible += migration->pending_compatible;
>> +    *res_postcopy_only += migration->pending_postcopy; }
>> +
>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) {
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    MigrationState *ms = migrate_get_current();
>> +    int ret;
>> +
>> +    if (vbasedev->vm_running) {
>> +        vbasedev->vm_running = 0;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev,
>> +                                   VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
>> +    if (ret) {
>> +        error_report("Failed to set state STOPNCOPY_ACTIVE");
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_save_device_config_state(f, opaque);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    do {
>> +        vfio_update_pending(vbasedev, ms->threshold_size);
>> +
>> +        if (vfio_is_active_iterate(opaque)) {
>> +            ret = vfio_save_buffer(f, vbasedev);
>> +            if (ret < 0) {
>> +                error_report("Failed to save buffer");
>> +                break;
>> +            } else if (ret == 0) {
>> +                break;
>> +            }
>> +        }
>> +    } while ((migration->pending_compatible +
>> + migration->pending_postcopy) > 0);
>> +
> 
> if migration->pending_postcopy is not 0, vfio_save_complete_precopy() cannot finish? Is that true?
> But vfio_save_complete_precopy() does not need to copy post copy data.
> 
>

This is stop and copy phase, that is pre-copy phase ended and vCPUs are
stopped. So all data for device should be copied in this callback.

> 
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev,
>> +                                   VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED);
>> +    if (ret) {
>> +        error_report("Failed to set state SAVE_COMPLETED");
>> +        return ret;
>> +    }
>> +    return ret;
>> +}
>> +
>> +static void vfio_save_cleanup(void *opaque) {
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    vfio_migration_region_exit(vbasedev);
>> +}
>> +
>> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) {
>> +    VFIODevice *vbasedev = opaque;
>> +    int ret;
>> +    uint64_t data;
>> +
> Need to use version_id to check source and target version.
> 

Good point. I'll add that.

>> +    data = qemu_get_be64(f);
>> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
>> +        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
>> +            ret = vfio_load_device_config_state(f, opaque);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
>> +            data = qemu_get_be64(f);
>> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
>> +                return 0;
>> +            } else {
>> +                error_report("SETUP STATE: EOS not found 0x%lx", data);
>> +                return -EINVAL;
>> +            }
>> +        } else if (data != 0) {
>> +            VFIOMigration *migration = vbasedev->migration;
>> +            VFIORegion *region = &migration->region.buffer;
>> +            struct vfio_device_migration_info migration_info;
>> +            void *buf = NULL;
>> +            bool buffer_mmaped = false;
>> +
>> +            migration_info.data.size = data;
>> +
>> +            if (region->mmaps) {
>> +                int i;
>> +
>> +                for (i = 0; i < region->nr_mmaps; i++) {
>> +                    if (region->mmaps[i].mmap &&
>> +                        (region->mmaps[i].size >= data)) {
>> +                        buf = region->mmaps[i].mmap;
>> +                        migration_info.data.offset = region->mmaps[i].offset;
>> +                        buffer_mmaped = true;
>> +                        break;
>> +                    }
>> +                }
>> +            }
>> +
>> +            if (!buffer_mmaped) {
>> +                buf = g_malloc0(migration_info.data.size);
>> +                migration_info.data.offset = sizeof(migration_info) + 1;
>> +            }
>> +
>> +            qemu_get_buffer(f, buf, data);
>> +
>> +            ret = pwrite(vbasedev->fd, &migration_info.data,
>> +                         sizeof(migration_info.data),
>> +                         region->fd_offset +
>> +                         offsetof(struct vfio_device_migration_info, data));
>> +            if (ret != sizeof(migration_info.data)) {
>> +                error_report("Failed to set migration buffer information %d",
>> +                        ret);
>> +                return -EINVAL;
>> +            }
>> +
>> +            if (!buffer_mmaped) {
>> +                ret = pwrite(vbasedev->fd, buf, migration_info.data.size,
>> +                             region->fd_offset + migration_info.data.offset);
>> +                g_free(buf);
>> +
>> +                if (ret != migration_info.data.size) {
>> +                    error_report("Failed to set migration buffer %d", ret);
>> +                    return -EINVAL;
>> +                }
>> +            }
>> +        }
>> +
>> +        ret = qemu_file_get_error(f);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +        data = qemu_get_be64(f);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_load_setup(QEMUFile *f, void *opaque) {
>> +    VFIODevice *vbasedev = opaque;
>> +    int ret;
>> +
>> +    ret = vfio_migration_set_state(vbasedev,
>> +                                    VFIO_DEVICE_STATE_MIGRATION_RESUME);
>> +    if (ret) {
>> +        error_report("Failed to set state RESUME");
>> +    }
>> +
>> +    ret = vfio_migration_region_init(vbasedev);
>> +    if (ret) {
>> +        error_report("Failed to initialise migration region");
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
> Why vfio_migration_set_state() is in front of  vfio_migration_region_init()?
> So VFIO_DEVICE_STATE_MIGRATION_RESUME is really useful? :)
> 
> 

Yes, how will kernel driver know that resume has started?


>> +static int vfio_load_cleanup(void *opaque) {
>> +    VFIODevice *vbasedev = opaque;
>> +    int ret = 0;
>> +
>> +    ret = vfio_migration_set_state(vbasedev,
>> +                                 VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED);
>> +    if (ret) {
>> +        error_report("Failed to set state RESUME_COMPLETED");
>> +    }
>> +
>> +    vfio_migration_region_exit(vbasedev);
>> +    return ret;
>> +}
>> +
>> +static SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_live_iterate = vfio_save_iterate,
>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>> +    .save_live_pending = vfio_save_pending,
>> +    .save_cleanup = vfio_save_cleanup,
>> +    .load_state = vfio_load_state,
>> +    .load_setup = vfio_load_setup,
>> +    .load_cleanup = vfio_load_cleanup,
>> +    .is_active_iterate = vfio_is_active_iterate, };
>> +
>> +static void vfio_vmstate_change(void *opaque, int running, RunState
>> +state) {
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    if ((vbasedev->vm_running != running) && running) {
>> +        int ret;
>> +
>> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
>> +        if (ret) {
>> +            error_report("Failed to set state RUNNING");
>> +        }
>> +    }
>> +
>> +    vbasedev->vm_running = running;
>> +}
>> +
> vfio_vmstate_change() is registered at initialization, so for source vm 
> vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING) will be called on vm start.
> but vfio_migration_region_init() is called in save_setup() when migration starts,
> as a result,  "vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING)" should not function here for source vm.
> So, again, is this state really used? :)
> 
> 

vfio_vmstate_change() is not only called at VM start, this also gets
called after resume is done. Kernel driver need to know after resume
that vCPUs are running.

Thanks,
Kirti

>> +static void vfio_migration_state_notifier(Notifier *notifier, void
>> +*data) {
>> +    MigrationState *s = data;
>> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
>> +    int ret;
>> +
>> +    switch (s->state) {
>> +    case MIGRATION_STATUS_SETUP:
>> +        ret = vfio_migration_set_state(vbasedev,
>> +                                       VFIO_DEVICE_STATE_MIGRATION_SETUP);
>> +        if (ret) {
>> +            error_report("Failed to set state SETUP");
>> +        }
>> +        return;
>> +
>> +    case MIGRATION_STATUS_ACTIVE:
>> +        if (vbasedev->device_state == VFIO_DEVICE_STATE_MIGRATION_SETUP) {
>> +            if (vbasedev->vm_running) {
>> +                ret = vfio_migration_set_state(vbasedev,
>> +                                          VFIO_DEVICE_STATE_MIGRATION_PRECOPY);
>> +                if (ret) {
>> +                    error_report("Failed to set state PRECOPY_ACTIVE");
>> +                }
>> +            } else {
>> +                ret = vfio_migration_set_state(vbasedev,
>> +                                        VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
>> +                if (ret) {
>> +                    error_report("Failed to set state STOPNCOPY_ACTIVE");
>> +                }
>> +            }
>> +        } else {
>> +            ret = vfio_migration_set_state(vbasedev,
>> +                                           VFIO_DEVICE_STATE_MIGRATION_RESUME);
>> +            if (ret) {
>> +                error_report("Failed to set state RESUME");
>> +            }
>> +        }
>> +        return;
>> +
>> +    case MIGRATION_STATUS_CANCELLING:
>> +    case MIGRATION_STATUS_CANCELLED:
>> +        ret = vfio_migration_set_state(vbasedev,
>> +                                       VFIO_DEVICE_STATE_MIGRATION_CANCELLED);
>> +        if (ret) {
>> +            error_report("Failed to set state CANCELLED");
>> +        }
>> +        return;
>> +
>> +    case MIGRATION_STATUS_FAILED:
>> +        ret = vfio_migration_set_state(vbasedev,
>> +                                       VFIO_DEVICE_STATE_MIGRATION_FAILED);
>> +        if (ret) {
>> +            error_report("Failed to set state FAILED");
>> +        }
>> +        return;
>> +    }
>> +}
>> +
>> +static int vfio_migration_init(VFIODevice *vbasedev,
>> +                               struct vfio_region_info *info) {
>> +    vbasedev->migration = g_new0(VFIOMigration, 1);
>> +    vbasedev->migration->region.index = info->index;
>> +
>> +    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
>> +    vbasedev->vm_state =
>> qemu_add_vm_change_state_handler(vfio_vmstate_change,
>> +                                                          vbasedev);
>> +
>> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
>> +    add_migration_state_change_notifier(&vbasedev->migration_state);
>> +
>> +    return 0;
>> +}
>> +
>> +
>> +/*
>> +----------------------------------------------------------------------
>> +*/
>> +
>> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) {
>> +    struct vfio_region_info *info;
>> +    int ret;
>> +
>> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
>> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
>> +    if (ret) {
>> +        Error *local_err = NULL;
>> +
>> +        error_setg(&vbasedev->migration_blocker,
>> +                   "VFIO device doesn't support migration");
>> +        ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            error_free(vbasedev->migration_blocker);
>> +            return ret;
>> +        }
>> +    } else {
>> +        return vfio_migration_init(vbasedev, info);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +void vfio_migration_finalize(VFIODevice *vbasedev) {
>> +    if (!vbasedev->migration) {
>> +        return;
>> +    }
>> +
>> +    if (vbasedev->vm_state) {
>> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
>> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
>> +    }
>> +
>> +    if (vbasedev->migration_blocker) {
>> +        migrate_del_blocker(vbasedev->migration_blocker);
>> +        error_free(vbasedev->migration_blocker);
>> +    }
>> +
>> +    g_free(vbasedev->migration);
>> +}
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index a9036929b220..ab8217c9e249 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -30,6 +30,8 @@
>>  #include <linux/vfio.h>
>>  #endif
>>
>> +#include "sysemu/sysemu.h"
>> +
>>  #define ERR_PREFIX "vfio error: %s: "
>>  #define WARN_PREFIX "vfio warning: %s: "
>>
>> @@ -57,6 +59,16 @@ typedef struct VFIORegion {
>>      uint8_t nr; /* cache the region number for debug */  } VFIORegion;
>>
>> +typedef struct VFIOMigration {
>> +    struct {
>> +        VFIORegion buffer;
>> +        uint32_t index;
>> +    } region;
>> +    uint64_t pending_precopy_only;
>> +    uint64_t pending_compatible;
>> +    uint64_t pending_postcopy;
>> +} VFIOMigration;
>> +
>>  typedef struct VFIOAddressSpace {
>>      AddressSpace *as;
>>      QLIST_HEAD(, VFIOContainer) containers; @@ -116,6 +128,12 @@ typedef
>> struct VFIODevice {
>>      unsigned int num_irqs;
>>      unsigned int num_regions;
>>      unsigned int flags;
>> +    uint32_t device_state;
>> +    VMChangeStateEntry *vm_state;
>> +    int vm_running;
>> +    Notifier migration_state;
>> +    VFIOMigration *migration;
>> +    Error *migration_blocker;
>>  } VFIODevice;
>>
>>  struct VFIODeviceOps {
>> @@ -193,4 +211,9 @@ int vfio_spapr_create_window(VFIOContainer
>> *container,  int vfio_spapr_remove_window(VFIOContainer *container,
>>                               hwaddr offset_within_address_space);
>>
>> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp); void
>> +vfio_migration_finalize(VFIODevice *vbasedev); void
>> +vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_addr,
>> +                               uint64_t pfn_count);
>> +
>>  #endif /* HW_VFIO_VFIO_COMMON_H */
>> --
>> 2.7.0
>>
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-11-22 21:21     ` Kirti Wankhede
@ 2018-11-23  5:29       ` Zhao Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zhao Yan @ 2018-11-23  5:29 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, Ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	Changpeng.liu, Ken.Xue

On Fri, Nov 23, 2018 at 02:51:39AM +0530, Kirti Wankhede wrote:
> 
> 
> On 11/21/2018 1:09 PM, Zhao, Yan Y wrote:
> > 
> > 
> >> -----Original Message-----
> >> From: Qemu-devel [mailto:qemu-devel-
> >> bounces+yan.y.zhao=intel.com@nongnu.org] On Behalf Of Kirti Wankhede
> >> Sent: Wednesday, November 21, 2018 4:40 AM
> >> To: alex.williamson@redhat.com; cjia@nvidia.com
> >> Cc: Zhengxiao.zx@Alibaba-inc.com; Tian, Kevin <kevin.tian@intel.com>; Liu, Yi L
> >> <yi.l.liu@intel.com>; eskultet@redhat.com; Yang, Ziye <ziye.yang@intel.com>;
> >> qemu-devel@nongnu.org; cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> >> dgilbert@redhat.com; Wang, Zhi A <zhi.a.wang@intel.com>;
> >> mlevitsk@redhat.com; pasic@linux.ibm.com; aik@ozlabs.ru; Kirti Wankhede
> >> <kwankhede@nvidia.com>; eauger@redhat.com; felipe@nutanix.com;
> >> jonathan.davies@nutanix.com; Liu, Changpeng <changpeng.liu@intel.com>;
> >> Ken.Xue@amd.com
> >> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> >>
> >> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> >> - Added SaveVMHandlers and implemented all basic functions required for live
> >>   migration.
> >> - Added VM state change handler to know running or stopped state of VM.
> >> - Added migration state change notifier to get notification on migration state
> >>   change. This state is translated to VFIO device state and conveyed to vendor
> >>   driver.
> >> - VFIO device supportd migration or not is decided based of migration region
> >>   query. If migration region query is successful then migration is supported
> >>   else migration is blocked.
> >> - Structure vfio_device_migration_info is mapped at 0th offset of migration
> >>   region and should always trapped by VFIO device's driver. Added both type of
> >>   access support, trapped or mmapped, for data section of the region.
> >> - To save device state, read data offset and size using structure
> >>   vfio_device_migration_info.data, accordingly copy data from the region.
> >> - To restore device state, write data offset and size in the structure and write
> >>   data in the region.
> >> - To get dirty page bitmap, write start address and pfn count then read count of
> >>   pfns copied and accordingly read those from the rest of the region or mmaped
> >>   part of the region. This copy is iterated till page bitmap for all requested
> >>   pfns are copied.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/Makefile.objs         |   2 +-
> >>  hw/vfio/migration.c           | 729
> >> ++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/vfio/vfio-common.h |  23 ++
> >>  3 files changed, 753 insertions(+), 1 deletion(-)  create mode 100644
> >> hw/vfio/migration.c
> >>
> >> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index
> >> a2e7a0a7cf02..2cf2ba1440f2 100644
> >> --- a/hw/vfio/Makefile.objs
> >> +++ b/hw/vfio/Makefile.objs
> >> @@ -1,5 +1,5 @@
> >>  ifeq ($(CONFIG_LINUX), y)
> >> -obj-$(CONFIG_SOFTMMU) += common.o
> >> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
> >>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> >>  obj-$(CONFIG_VFIO_CCW) += ccw.o
> >>  obj-$(CONFIG_SOFTMMU) += platform.o
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644
> >> index 000000000000..717fb63e4f43
> >> --- /dev/null
> >> +++ b/hw/vfio/migration.c
> >> @@ -0,0 +1,729 @@
> >> +/*
> >> + * Migration support for VFIO devices
> >> + *
> >> + * Copyright NVIDIA, Inc. 2018
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2. See
> >> + * the COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include <linux/vfio.h>
> >> +
> >> +#include "hw/vfio/vfio-common.h"
> >> +#include "cpu.h"
> >> +#include "migration/migration.h"
> >> +#include "migration/qemu-file.h"
> >> +#include "migration/register.h"
> >> +#include "migration/blocker.h"
> >> +#include "migration/misc.h"
> >> +#include "qapi/error.h"
> >> +#include "exec/ramlist.h"
> >> +#include "exec/ram_addr.h"
> >> +#include "pci.h"
> >> +
> >> +/*
> >> + * Flags used as delimiter:
> >> + * 0xffffffff => MSB 32-bit all 1s
> >> + * 0xef10     => emulated (virtual) function IO
> >> + * 0x0000     => 16-bits reserved for flags
> >> + */
> >> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> >> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> >> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> >> +
> >> +static void vfio_migration_region_exit(VFIODevice *vbasedev) {
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +
> >> +    if (!migration) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (migration->region.buffer.size) {
> >> +        vfio_region_exit(&migration->region.buffer);
> >> +        vfio_region_finalize(&migration->region.buffer);
> >> +    }
> >> +}
> >> +
> >> +static int vfio_migration_region_init(VFIODevice *vbasedev) {
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    Object *obj = NULL;
> >> +    int ret = -EINVAL;
> >> +
> >> +    if (!migration) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    /* Migration support added for PCI device only */
> >> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> >> +        obj = vfio_pci_get_object(vbasedev);
> >> +    }
> >> +
> >> +    if (!obj) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
> >> +                            migration->region.index, "migration");
> >> +    if (ret) {
> >> +        error_report("Failed to setup VFIO migration region %d: %s",
> >> +                      migration->region.index, strerror(-ret));
> >> +        goto err;
> >> +    }
> >> +
> >> +    if (!migration->region.buffer.size) {
> >> +        ret = -EINVAL;
> >> +        error_report("Invalid region size of VFIO migration region %d: %s",
> >> +                     migration->region.index, strerror(-ret));
> >> +        goto err;
> >> +    }
> >> +
> >> +    if (migration->region.buffer.mmaps) {
> >> +        ret = vfio_region_mmap(&migration->region.buffer);
> >> +        if (ret) {
> >> +            error_report("Failed to mmap VFIO migration region %d: %s",
> >> +                         migration->region.index, strerror(-ret));
> >> +            goto err;
> >> +        }
> >> +    }
> >> +
> >> +    return 0;
> >> +
> >> +err:
> >> +    vfio_migration_region_exit(vbasedev);
> >> +    return ret;
> >> +}
> >> +
> >> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t
> >> +state) {
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region.buffer;
> >> +    int ret = 0;
> >> +
> >> +    if (vbasedev->device_state == state) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    ret = pwrite(vbasedev->fd, &state, sizeof(state),
> >> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                              device_state));
> >> +    if (ret < 0) {
> >> +        error_report("Failed to set migration state %d %s",
> >> +                     ret, strerror(errno));
> >> +        return ret;
> >> +    }
> >> +
> >> +    vbasedev->device_state = state;
> >> +    return ret;
> >> +}
> >> +
> >> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> >> +                              uint64_t start_addr,
> >> +                              uint64_t pfn_count) {
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region.buffer;
> >> +    struct vfio_device_migration_info migration_info;
> >> +    uint64_t count = 0;
> >> +    int ret;
> >> +
> >> +    migration_info.dirty_pfns.start_addr = start_addr;
> >> +    migration_info.dirty_pfns.total = pfn_count;
> >> +
> >> +    ret = pwrite(vbasedev->fd, &migration_info.dirty_pfns,
> >> +                 sizeof(migration_info.dirty_pfns),
> >> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                              dirty_pfns));
> >> +    if (ret < 0) {
> >> +        error_report("Failed to set dirty pages start address %d %s",
> >> +                ret, strerror(errno));
> >> +        return;
> >> +    }
> >> +
> >> +    do {
> >> +        /* Read dirty_pfns.copied */
> >> +        ret = pread(vbasedev->fd, &migration_info.dirty_pfns,
> >> +                sizeof(migration_info.dirty_pfns),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             dirty_pfns));
> >> +        if (ret < 0) {
> >> +            error_report("Failed to get dirty pages bitmap count %d %s",
> >> +                    ret, strerror(errno));
> >> +            return;
> >> +        }
> >> +
> >> +        if (migration_info.dirty_pfns.copied) {
> >> +            uint64_t bitmap_size;
> >> +            void *buf = NULL;
> >> +            bool buffer_mmaped = false;
> >> +
> >> +            bitmap_size = (BITS_TO_LONGS(migration_info.dirty_pfns.copied) + 1)
> >> +                           * sizeof(unsigned long);
> >> +
> >> +            if (region->mmaps) {
> >> +                int i;
> >> +
> >> +                for (i = 0; i < region->nr_mmaps; i++) {
> >> +                    if (region->mmaps[i].size >= bitmap_size) {
> >> +                        buf = region->mmaps[i].mmap;
> >> +                        buffer_mmaped = true;
> >> +                        break;
> >> +                    }
> >> +                }
> >> +            }
> > What if mmapped data area is in front of mmaped dirty bit area?
> > Maybe you need to record dirty bit region's index as what does for data region.
> >
> 
> No, data section is not different than dirty bit area.
> Migration region is like:
>  ----------------------------------------------------------------------
> |vfio_device_migration_info|    data section			      |	
> |                          |     ///////////////////////////////////  |
>  ----------------------------------------------------------------------
>  ^			         ^                                 ^
>  offset 0-trapped part         data.offset                     data.size
> 
> Same data section is used to copy vendor driver data during save and
> during dirty page bitmap query.
Right, I know data section is the same as dirty bit section, as they are
both mmaped. but it's not right to assume dirty bit section to be the first
one to found from region->mmaps[i].mmap.

> 
> >> +
> >> +            if (!buffer_mmaped) {
> >> +                buf = g_malloc0(bitmap_size);
> >> +
> >> +                ret = pread(vbasedev->fd, buf, bitmap_size,
> >> +                            region->fd_offset + sizeof(migration_info) + 1);
Ditto. it's not right to assume dirty bit section starting right after
migration_info.

> >> +                if (ret != bitmap_size) {
> >> +                    error_report("Failed to get migration data %d", ret);
> >> +                    g_free(buf);
> >> +                    return;
> >> +                }
> >> +            }
> >> +
> >> +            cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> >> +                                        start_addr + (count * TARGET_PAGE_SIZE),
> >> +                                        migration_info.dirty_pfns.copied);
> >> +            count +=  migration_info.dirty_pfns.copied;
> >> +
> >> +            if (!buffer_mmaped) {
> >> +                g_free(buf);
> >> +            }
> >> +        }
> >> +    } while (count < migration_info.dirty_pfns.total); }
> >> +
> >> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> >> +
> >> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> >> +        vfio_pci_save_config(vbasedev, f);
> >> +    }
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    return qemu_file_get_error(f);
> >> +}
> >> +
> >> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> >> +        vfio_pci_load_config(vbasedev, f);
> >> +    }
> >> +
> >> +    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
> >> +        error_report("Wrong end of block while loading device config space");
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    return qemu_file_get_error(f);
> >> +}
> >> +
> > What's the purpose to add a tailing VFIO_MIG_FLAG_END_OF_STATE for each section? 
> > For compatibility check?
> > Maybe a version id or magic in struct vfio_device_migration_info is more appropriate?
> > 
> 
> No, this is to identify the section end during resume, see
> vfio_load_state().
> 
as long as save/load are of the same version, it's not necessary to add a
taling ending mark.

> > 
> >> +/*
> >> +----------------------------------------------------------------------
> >> +*/
> >> +
> >> +static bool vfio_is_active_iterate(void *opaque) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    if (vbasedev->vm_running && vbasedev->migration &&
> >> +        (vbasedev->migration->pending_precopy_only != 0))
> >> +        return true;
> >> +
> >> +    if (!vbasedev->vm_running && vbasedev->migration &&
> >> +        (vbasedev->migration->pending_postcopy != 0))
> >> +        return true;
> >> +
> >> +    return false;
> >> +}
> >> +
> >> +static int vfio_save_setup(QEMUFile *f, void *opaque) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +    int ret;
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> >> +
> >> +    qemu_mutex_lock_iothread();
> >> +    ret = vfio_migration_region_init(vbasedev);
> >> +    qemu_mutex_unlock_iothread();
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev) {
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region.buffer;
> >> +    struct vfio_device_migration_info migration_info;
> >> +    int ret;
> >> +
> >> +    ret = pread(vbasedev->fd, &migration_info.data,
> >> +                sizeof(migration_info.data),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data));
> >> +    if (ret != sizeof(migration_info.data)) {
> >> +        error_report("Failed to get migration buffer information %d",
> >> +                     ret);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    if (migration_info.data.size) {
> >> +        void *buf = NULL;
> >> +        bool buffer_mmaped = false;
> >> +
> >> +        if (region->mmaps) {
> >> +            int i;
> >> +
> >> +            for (i = 0; i < region->nr_mmaps; i++) {
> >> +                if (region->mmaps[i].offset == migration_info.data.offset) {
> >> +                    buf = region->mmaps[i].mmap;
> >> +                    buffer_mmaped = true;
> >> +                    break;
> >> +                }
> >> +            }
> >> +        }
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            buf = g_malloc0(migration_info.data.size);
> >> +            ret = pread(vbasedev->fd, buf, migration_info.data.size,
> >> +                        region->fd_offset + migration_info.data.offset);
> >> +            if (ret != migration_info.data.size) {
> >> +                error_report("Failed to get migration data %d", ret);
> >> +                return -EINVAL;
> >> +            }
> >> +        }
> >> +
> >> +        qemu_put_be64(f, migration_info.data.size);
> >> +        qemu_put_buffer(f, buf, migration_info.data.size);
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            g_free(buf);
> >> +        }
> >> +
> >> +    } else {
> >> +        qemu_put_be64(f, migration_info.data.size);
> >> +    }
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    return migration_info.data.size;
> >> +}
> >> +
> >> +static int vfio_save_iterate(QEMUFile *f, void *opaque) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +    int ret;
> >> +
> >> +    ret = vfio_save_buffer(f, vbasedev);
> >> +    if (ret < 0) {
> >> +        error_report("vfio_save_buffer failed %s",
> >> +                     strerror(errno));
> >> +        return ret;
> >> +    }
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static void vfio_update_pending(VFIODevice *vbasedev, uint64_t
> >> +threshold_size) {
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region.buffer;
> >> +    struct vfio_device_migration_info migration_info;
> >> +    int ret;
> >> +
> >> +    ret = pwrite(vbasedev->fd, &threshold_size, sizeof(threshold_size),
> >> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                              pending.threshold_size));
> >> +    if (ret < 0) {
> >> +        error_report("Failed to set threshold size %d %s",
> >> +                     ret, strerror(errno));
> >> +        return;
> >> +    }
> >> +
> >> +    ret = pread(vbasedev->fd, &migration_info.pending,
> >> +                sizeof(migration_info.pending),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             pending));
> >> +    if (ret != sizeof(migration_info.pending)) {
> >> +        error_report("Failed to get pending bytes %d", ret);
> >> +        return;
> >> +    }
> >> +
> >> +    migration->pending_precopy_only = migration_info.pending.precopy_only;
> >> +    migration->pending_compatible = migration_info.pending.compatible;
> >> +    migration->pending_postcopy = migration_info.pending.postcopy_only;
> >> +
> >> +    return;
> >> +}
> >> +
> >> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> >> +                              uint64_t threshold_size,
> >> +                              uint64_t *res_precopy_only,
> >> +                              uint64_t *res_compatible,
> >> +                              uint64_t *res_postcopy_only) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +
> >> +    vfio_update_pending(vbasedev, threshold_size);
> >> +
> >> +    *res_precopy_only += migration->pending_precopy_only;
> >> +    *res_compatible += migration->pending_compatible;
> >> +    *res_postcopy_only += migration->pending_postcopy; }
> >> +
> >> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    MigrationState *ms = migrate_get_current();
> >> +    int ret;
> >> +
> >> +    if (vbasedev->vm_running) {
> >> +        vbasedev->vm_running = 0;
> >> +    }
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev,
> >> +                                   VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> >> +    if (ret) {
> >> +        error_report("Failed to set state STOPNCOPY_ACTIVE");
> >> +        return ret;
> >> +    }
> >> +
> >> +    ret = vfio_save_device_config_state(f, opaque);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    do {
> >> +        vfio_update_pending(vbasedev, ms->threshold_size);
> >> +
> >> +        if (vfio_is_active_iterate(opaque)) {
> >> +            ret = vfio_save_buffer(f, vbasedev);
> >> +            if (ret < 0) {
> >> +                error_report("Failed to save buffer");
> >> +                break;
> >> +            } else if (ret == 0) {
> >> +                break;
> >> +            }
> >> +        }
> >> +    } while ((migration->pending_compatible +
> >> + migration->pending_postcopy) > 0);
> >> +
> > 
> > if migration->pending_postcopy is not 0, vfio_save_complete_precopy() cannot finish? Is that true?
> > But vfio_save_complete_precopy() does not need to copy post copy data.
> > 
> >
> 
> This is stop and copy phase, that is pre-copy phase ended and vCPUs are
> stopped. So all data for device should be copied in this callback.
postcopy data is not requried to be copied in stop-and-copy phase.

> 
> > 
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev,
> >> +                                   VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED);
> >> +    if (ret) {
> >> +        error_report("Failed to set state SAVE_COMPLETED");
> >> +        return ret;
> >> +    }
> >> +    return ret;
> >> +}
> >> +
> >> +static void vfio_save_cleanup(void *opaque) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    vfio_migration_region_exit(vbasedev);
> >> +}
> >> +
> >> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +    int ret;
> >> +    uint64_t data;
> >> +
> > Need to use version_id to check source and target version.
> > 
> 
> Good point. I'll add that.
> 
> >> +    data = qemu_get_be64(f);
> >> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> >> +        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> >> +            ret = vfio_load_device_config_state(f, opaque);
> >> +            if (ret) {
> >> +                return ret;
> >> +            }
> >> +        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
> >> +            data = qemu_get_be64(f);
> >> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> >> +                return 0;
> >> +            } else {
> >> +                error_report("SETUP STATE: EOS not found 0x%lx", data);
> >> +                return -EINVAL;
> >> +            }
> >> +        } else if (data != 0) {
> >> +            VFIOMigration *migration = vbasedev->migration;
> >> +            VFIORegion *region = &migration->region.buffer;
> >> +            struct vfio_device_migration_info migration_info;
> >> +            void *buf = NULL;
> >> +            bool buffer_mmaped = false;
> >> +
> >> +            migration_info.data.size = data;
> >> +
> >> +            if (region->mmaps) {
> >> +                int i;
> >> +
> >> +                for (i = 0; i < region->nr_mmaps; i++) {
> >> +                    if (region->mmaps[i].mmap &&
> >> +                        (region->mmaps[i].size >= data)) {
> >> +                        buf = region->mmaps[i].mmap;
> >> +                        migration_info.data.offset = region->mmaps[i].offset;
> >> +                        buffer_mmaped = true;
> >> +                        break;
> >> +                    }
> >> +                }
> >> +            }
> >> +
> >> +            if (!buffer_mmaped) {
> >> +                buf = g_malloc0(migration_info.data.size);
> >> +                migration_info.data.offset = sizeof(migration_info) + 1;
> >> +            }
> >> +
> >> +            qemu_get_buffer(f, buf, data);
> >> +
> >> +            ret = pwrite(vbasedev->fd, &migration_info.data,
> >> +                         sizeof(migration_info.data),
> >> +                         region->fd_offset +
> >> +                         offsetof(struct vfio_device_migration_info, data));
> >> +            if (ret != sizeof(migration_info.data)) {
> >> +                error_report("Failed to set migration buffer information %d",
> >> +                        ret);
> >> +                return -EINVAL;
> >> +            }
> >> +
> >> +            if (!buffer_mmaped) {
> >> +                ret = pwrite(vbasedev->fd, buf, migration_info.data.size,
> >> +                             region->fd_offset + migration_info.data.offset);
> >> +                g_free(buf);
> >> +
> >> +                if (ret != migration_info.data.size) {
> >> +                    error_report("Failed to set migration buffer %d", ret);
> >> +                    return -EINVAL;
> >> +                }
> >> +            }
> >> +        }
> >> +
> >> +        ret = qemu_file_get_error(f);
> >> +        if (ret) {
> >> +            return ret;
> >> +        }
> >> +        data = qemu_get_be64(f);
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static int vfio_load_setup(QEMUFile *f, void *opaque) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +    int ret;
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev,
> >> +                                    VFIO_DEVICE_STATE_MIGRATION_RESUME);
> >> +    if (ret) {
> >> +        error_report("Failed to set state RESUME");
> >> +    }
> >> +
> >> +    ret = vfio_migration_region_init(vbasedev);
> >> +    if (ret) {
> >> +        error_report("Failed to initialise migration region");
> >> +        return ret;
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> > Why vfio_migration_set_state() is in front of  vfio_migration_region_init()?
> > So VFIO_DEVICE_STATE_MIGRATION_RESUME is really useful? :)
> > 
> > 
> 
> Yes, how will kernel driver know that resume has started?
> 
> 
My point is that the migration region is setup in
vfio_migration_region_init(), while vfio_migration_set_state() depends on
the migration region to pass commands into the vendor driver.
so, before vfio_migration_region_init() is called, the calling of
vfio_migration_set_state() cannot succeed.


> >> +static int vfio_load_cleanup(void *opaque) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +    int ret = 0;
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev,
> >> +                                 VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED);
> >> +    if (ret) {
> >> +        error_report("Failed to set state RESUME_COMPLETED");
> >> +    }
> >> +
> >> +    vfio_migration_region_exit(vbasedev);
> >> +    return ret;
> >> +}
> >> +
> >> +static SaveVMHandlers savevm_vfio_handlers = {
> >> +    .save_setup = vfio_save_setup,
> >> +    .save_live_iterate = vfio_save_iterate,
> >> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> >> +    .save_live_pending = vfio_save_pending,
> >> +    .save_cleanup = vfio_save_cleanup,
> >> +    .load_state = vfio_load_state,
> >> +    .load_setup = vfio_load_setup,
> >> +    .load_cleanup = vfio_load_cleanup,
> >> +    .is_active_iterate = vfio_is_active_iterate, };
> >> +
> >> +static void vfio_vmstate_change(void *opaque, int running, RunState
> >> +state) {
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    if ((vbasedev->vm_running != running) && running) {
> >> +        int ret;
> >> +
> >> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
> >> +        if (ret) {
> >> +            error_report("Failed to set state RUNNING");
> >> +        }
> >> +    }
> >> +
> >> +    vbasedev->vm_running = running;
> >> +}
> >> +
> > vfio_vmstate_change() is registered at initialization, so for source vm 
> > vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING) will be called on vm start.
> > but vfio_migration_region_init() is called in save_setup() when migration starts,
> > as a result,  "vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING)" should not function here for source vm.
> > So, again, is this state really used? :)
> > 
> > 
> 
> vfio_vmstate_change() is not only called at VM start, this also gets
> called after resume is done. Kernel driver need to know after resume
> that vCPUs are running.
> 
so at the source vm side, on vm start, vfio_vmstate_change() and
vfio_migration_set_state(VFIO_DEVICE_STATE_RUNNING) are called.
However, at that stage, the vfio_migration_set_state() cannot succeed,
because migration regions are not setup at that stage. Migration regions
are setup in save_setup() when migration starts.



> Thanks,
> Kirti
> 
> >> +static void vfio_migration_state_notifier(Notifier *notifier, void
> >> +*data) {
> >> +    MigrationState *s = data;
> >> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> >> +    int ret;
> >> +
> >> +    switch (s->state) {
> >> +    case MIGRATION_STATUS_SETUP:
> >> +        ret = vfio_migration_set_state(vbasedev,
> >> +                                       VFIO_DEVICE_STATE_MIGRATION_SETUP);
> >> +        if (ret) {
> >> +            error_report("Failed to set state SETUP");
> >> +        }
> >> +        return;
> >> +
> >> +    case MIGRATION_STATUS_ACTIVE:
> >> +        if (vbasedev->device_state == VFIO_DEVICE_STATE_MIGRATION_SETUP) {
> >> +            if (vbasedev->vm_running) {
> >> +                ret = vfio_migration_set_state(vbasedev,
> >> +                                          VFIO_DEVICE_STATE_MIGRATION_PRECOPY);
> >> +                if (ret) {
> >> +                    error_report("Failed to set state PRECOPY_ACTIVE");
> >> +                }
> >> +            } else {
> >> +                ret = vfio_migration_set_state(vbasedev,
> >> +                                        VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> >> +                if (ret) {
> >> +                    error_report("Failed to set state STOPNCOPY_ACTIVE");
> >> +                }
> >> +            }
> >> +        } else {
> >> +            ret = vfio_migration_set_state(vbasedev,
> >> +                                           VFIO_DEVICE_STATE_MIGRATION_RESUME);
> >> +            if (ret) {
> >> +                error_report("Failed to set state RESUME");
> >> +            }
> >> +        }
> >> +        return;
> >> +
> >> +    case MIGRATION_STATUS_CANCELLING:
> >> +    case MIGRATION_STATUS_CANCELLED:
> >> +        ret = vfio_migration_set_state(vbasedev,
> >> +                                       VFIO_DEVICE_STATE_MIGRATION_CANCELLED);
> >> +        if (ret) {
> >> +            error_report("Failed to set state CANCELLED");
> >> +        }
> >> +        return;
> >> +
> >> +    case MIGRATION_STATUS_FAILED:
> >> +        ret = vfio_migration_set_state(vbasedev,
> >> +                                       VFIO_DEVICE_STATE_MIGRATION_FAILED);
> >> +        if (ret) {
> >> +            error_report("Failed to set state FAILED");
> >> +        }
> >> +        return;
> >> +    }
> >> +}
> >> +
> >> +static int vfio_migration_init(VFIODevice *vbasedev,
> >> +                               struct vfio_region_info *info) {
> >> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> >> +    vbasedev->migration->region.index = info->index;
> >> +
> >> +    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
> >> +    vbasedev->vm_state =
> >> qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >> +                                                          vbasedev);
> >> +
> >> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> >> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +
> >> +/*
> >> +----------------------------------------------------------------------
> >> +*/
> >> +
> >> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) {
> >> +    struct vfio_region_info *info;
> >> +    int ret;
> >> +
> >> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> >> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> >> +    if (ret) {
> >> +        Error *local_err = NULL;
> >> +
> >> +        error_setg(&vbasedev->migration_blocker,
> >> +                   "VFIO device doesn't support migration");
> >> +        ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> >> +        if (local_err) {
> >> +            error_propagate(errp, local_err);
> >> +            error_free(vbasedev->migration_blocker);
> >> +            return ret;
> >> +        }
> >> +    } else {
> >> +        return vfio_migration_init(vbasedev, info);
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +void vfio_migration_finalize(VFIODevice *vbasedev) {
> >> +    if (!vbasedev->migration) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (vbasedev->vm_state) {
> >> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> >> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> >> +    }
> >> +
> >> +    if (vbasedev->migration_blocker) {
> >> +        migrate_del_blocker(vbasedev->migration_blocker);
> >> +        error_free(vbasedev->migration_blocker);
> >> +    }
> >> +
> >> +    g_free(vbasedev->migration);
> >> +}
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index a9036929b220..ab8217c9e249 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -30,6 +30,8 @@
> >>  #include <linux/vfio.h>
> >>  #endif
> >>
> >> +#include "sysemu/sysemu.h"
> >> +
> >>  #define ERR_PREFIX "vfio error: %s: "
> >>  #define WARN_PREFIX "vfio warning: %s: "
> >>
> >> @@ -57,6 +59,16 @@ typedef struct VFIORegion {
> >>      uint8_t nr; /* cache the region number for debug */  } VFIORegion;
> >>
> >> +typedef struct VFIOMigration {
> >> +    struct {
> >> +        VFIORegion buffer;
> >> +        uint32_t index;
> >> +    } region;
> >> +    uint64_t pending_precopy_only;
> >> +    uint64_t pending_compatible;
> >> +    uint64_t pending_postcopy;
> >> +} VFIOMigration;
> >> +
> >>  typedef struct VFIOAddressSpace {
> >>      AddressSpace *as;
> >>      QLIST_HEAD(, VFIOContainer) containers; @@ -116,6 +128,12 @@ typedef
> >> struct VFIODevice {
> >>      unsigned int num_irqs;
> >>      unsigned int num_regions;
> >>      unsigned int flags;
> >> +    uint32_t device_state;
> >> +    VMChangeStateEntry *vm_state;
> >> +    int vm_running;
> >> +    Notifier migration_state;
> >> +    VFIOMigration *migration;
> >> +    Error *migration_blocker;
> >>  } VFIODevice;
> >>
> >>  struct VFIODeviceOps {
> >> @@ -193,4 +211,9 @@ int vfio_spapr_create_window(VFIOContainer
> >> *container,  int vfio_spapr_remove_window(VFIOContainer *container,
> >>                               hwaddr offset_within_address_space);
> >>
> >> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp); void
> >> +vfio_migration_finalize(VFIODevice *vbasedev); void
> >> +vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_addr,
> >> +                               uint64_t pfn_count);
> >> +
> >>  #endif /* HW_VFIO_VFIO_COMMON_H */
> >> --
> >> 2.7.0
> >>
> > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Kirti Wankhede
                     ` (2 preceding siblings ...)
  2018-11-22 18:54   ` Dr. David Alan Gilbert
@ 2018-11-23  5:47   ` Zhao Yan
  2018-11-27 19:52   ` Alex Williamson
  4 siblings, 0 replies; 32+ messages in thread
From: Zhao Yan @ 2018-11-23  5:47 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, Ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	Changpeng.liu, Ken.Xue

On Wed, Nov 21, 2018 at 04:39:39AM +0800, Kirti Wankhede wrote:
> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined actions and members of structure usage for each action:
>     * To convey VFIO device state to be transitioned to.
>     * To get pending bytes yet to be migrated for VFIO device
>     * To ask driver to write data to migration region and return number of bytes
>       written in the region
>     * In migration resume path, user space app writes to migration region and
>       communicates it to vendor driver.
>     * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 130 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 130 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..a6e45cb2cae2 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG  (3)
> 
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION             (1 << 30)
> +#define VFIO_REGION_SUBTYPE_MIGRATION          (1)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
> 
>  #define VFIO_DEVICE_IOEVENTFD          _IO(VFIO_TYPE, VFIO_BASE + 16)
> 
> +/**
> + * VFIO device states :
> + * VFIO User space application should set the device state to indicate vendor
> + * driver in which state the VFIO device should transitioned.
> + * - VFIO_DEVICE_STATE_NONE:
> + *   State when VFIO device is initialized but not yet running.
> + * - VFIO_DEVICE_STATE_RUNNING:
> + *   Transition VFIO device in running state, that is, user space application or
> + *   VM is active.
> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.
> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are running,
> + *   transition VFIO device in pre-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> + *   When VFIO user space application or VM is stopped and vCPUs are halted,
> + *   transition VFIO device in stop-and-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> + *   When VFIO user space application has copied data provided by vendor driver.
> + *   This state is used by vendor driver to clean up all software state that was
> + *   setup during MIGRATION_SETUP state.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> + *   Transition VFIO device to resume state, that is, start resuming VFIO device
> + *   when user space application or VM is not running and vCPUs are halted.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> + *   When user space application completes iterations of providing device state
> + *   data, transition device in resume completed state.
> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> + *   Migration process failed due to some reason, transition device to failed
> + *   state. If migration process fails while saving at source, resume device at
> + *   source. If migration process fails while resuming application or VM at
> + *   destination, stop restoration at destination and resume at source.
> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> + *   User space application has cancelled migration process either for some
> + *   known reason or due to user's intervention. Transition device to Cancelled
> + *   state, that is, resume device state as it was during running state at
> + *   source.
> + */
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information.
> + *
> + * Action Set state:
> + *      To tell vendor driver the state VFIO device should be transitioned to.
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change, the state to which VFIO device should be
> + *           transitioned to.
> + *
> + * Action Get pending bytes:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      pending.threshold_size [Input] : threshold of buffer in User space app.
> + *      pending.precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          user space application or VM start. In case of migration, this
> + *          indicates pending bytes to be transfered while application or VM or
> + *          vCPUs are active and running.
> + *      pending.compatible [output] : pending data which may be migrated any
> + *          time , either when application or VM is active and vCPUs are active
> + *          or when application or VM is halted and vCPUs are halted.
> + *      pending.postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           application or VM stopped and vCPUs are halted.
> + *      Sum of pending.precopy_only, pending.compatible and
> + *      pending.postcopy_only is the whole amount of pending data.
> + *
> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.
suggest to add flag like restore-iteration/restore-complete to GET_BUFFER action.
Avoid to let vendor driver keep various qemu migration states


> + * Action Set buffer:
> + *      In migration resume path, user space app writes to migration region and
> + *      communicates it to vendor driver with this action.
> + *      data.offset [Input] : offset in the region from where data is written.
> + *      data.size [Input] : number of bytes written in migration buffer by
> + *          user space app.
suggest to add flag like precopy/stop-and-copy to SET_BUFFER action.
Avoid to let vendor driver keep various qemu migration states

> + *
> + * Action Get dirty pages bitmap:
> + *      Get bitmap of dirty pages from vendor driver from given start address.
> + *      dirty_pfns.start_addr [Input] : start address
> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
> + *          dirty bitmap is requested
> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is copied
> + *          to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;
> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
> 
>  /**
> --
> 2.7.0
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-22 20:43     ` Kirti Wankhede
@ 2018-11-23 11:44       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 32+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-23 11:44 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, qemu-devel

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 11/23/2018 12:24 AM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >> - Defined MIGRATION region type and sub-type.
> >> - Defined VFIO device states during migration process.
> >> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>   offset of migration region to get/set VFIO device related information.
> >>   Defined actions and members of structure usage for each action:
> >>     * To convey VFIO device state to be transitioned to.
> >>     * To get pending bytes yet to be migrated for VFIO device
> >>     * To ask driver to write data to migration region and return number of bytes
> >>       written in the region
> >>     * In migration resume path, user space app writes to migration region and
> >>       communicates it to vendor driver.
> >>     * Get bitmap of dirty pages from vendor driver from given start address
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > 
> > <snip>
> > 
> >> + * Action Get buffer:
> >> + *      On this action, vendor driver should write data to migration region and
> >> + *      return number of bytes written in the region.
> >> + *      data.offset [output] : offset in the region from where data is written.
> >> + *      data.size [output] : number of bytes written in migration buffer by
> >> + *          vendor driver.
> > 
> > <snip>
> > 
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +        __u32 device_state;         /* VFIO device state */
> >> +        struct {
> >> +            __u64 precopy_only;
> >> +            __u64 compatible;
> >> +            __u64 postcopy_only;
> >> +            __u64 threshold_size;
> >> +        } pending;
> >> +        struct {
> >> +            __u64 offset;           /* offset */
> >> +            __u64 size;             /* size */
> >> +        } data;
> > 
> > I'm curious how the offsets/size work; how does the 
> > kernel driver know the maximum size of state it's allowed to write?
> 
> 
> Migration region looks like:
>  ----------------------------------------------------------------------
> |vfio_device_migration_info|    data section			      |	
> |                          |     ///////////////////////////////////  |
>  ----------------------------------------------------------------------
>  ^			         ^                                 ^
>  offset 0-trapped part         data.offset                     data.size
> 
> 
> Kernel driver defines the size of migration region and tells VFIO user
> space application (QEMU here) through VFIO_DEVICE_GET_REGION_INFO ioctl.
> So kernel driver can calculate the size of data section. Then kernel
> driver can have (data.size >= data section size) or (data.size < data
> section size), hence VFIO user space application need to know data.size
> to copy only relevant data.
> 
> > Why would it pick a none-0 offset into the output region?
> 
> Data section is always followed by vfio_device_migration_info structure
> in the region, so data.offset will always be none-0.
> Offset from where data is copied is decided by kernel driver, data
> section can be trapped or mapped depending on how kernel driver defines
> data section. If mmapped, then data.offset should be page aligned, where
> as initial section which contain vfio_device_migration_info structure
> might not end at offset which is page aligned.

Ah OK; I see - it wasn't clear to me which buffer we were talking about
here; so yes it makes sense if it's one the kernel had the control of.

Dave

> Thanks,
> Kirti
> 
> > Without having dug further these feel like i/o rather than just output;
> > i.e. the calling process says 'put it at that offset and you've got size
> > bytes' and the kernel replies with 'I did put it at offset and I wrote
> > only this size bytes'
> > 
> > Dave
> > 
> >> +        struct {
> >> +            __u64 start_addr;
> >> +            __u64 total;
> >> +            __u64 copied;
> >> +        } dirty_pfns;
> >> +} __attribute__((packed));
> >> +
> >>  /* -------- API for Type1 VFIO IOMMU -------- */
> >>  
> >>  /**
> >> -- 
> >> 2.7.0
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-22 20:01         ` Kirti Wankhede
@ 2018-11-26  7:14           ` Tian, Kevin
  0 siblings, 0 replies; 32+ messages in thread
From: Tian, Kevin @ 2018-11-26  7:14 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: Yang, Ziye, Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet,
	cohuck, dgilbert, jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang, Zhi A, qemu-devel

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Friday, November 23, 2018 4:02 AM
> 
[...]
> >
> > I looked at the explanations in this patch, but still didn't get the intention,
> e.g.:
> >
> > + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> > + *   Transition VFIO device in migration setup state. This is used to
> prepare
> > + *   VFIO device for migration while application or VM and vCPUs are still
> in
> > + *   running state.
> >
> > what preparation is actually required? any example?
> 
> Each vendor driver can have different requirements as to how to prepare
> for migration. For example, this phase can be used to allocate buffer
> which can be mapped to MIGRATION region's data part, and allocating
> staging buffer. Driver might need to spawn thread which would start
> collecting data that need to be send during pre-copy phase.
> 
> >
> > + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> > + *   When VFIO user space application or VM is active and vCPUs are
> running,
> > + *   transition VFIO device in pre-copy state.
> >
> > why does device driver need know this stage? in precopy phase, the VM
> > is still running. Just dirty page tracking is in progress. the dirty bitmap
> could
> > be retrieved through its own action interface.
> >
> 
> All mdev devices are not similar. Pre-copy phase is not just about dirty
> page tracking. For devices which have memory on device could transfer
> data from that memory during pre-copy phase. For example, NVIDIA GPU
> has
> its own FB, so need to start sending FB data during pre-copy phase and
> then during stop and copy phase send data from FB which is marked dirty
> after that was copied in pre-copy phase. That helps to reduce total down
> time.

yes it makes sense, otherwise copying whole big FB at stop time is time
consuming. Curious, does Qemu already support pre-copy of device state
today, or is this series the 1st example to do that?

> 
> > you have code to demonstrate how those states are transitioned in Qemu,
> > but you didn't show evidence why those states are necessary in device
> side,
> > which leads to the puzzle whether the definition is over-killed and limiting.
> >
> 
> I'm trying to keep these interfaces generic for VFIO and mdev devices.
> Its difficult to define what vendor driver should do for each state,
> each vendor driver have their own requirements. Vendor drivers should
> decide whether to take any action on state transition or not.
> 
> > the flow in my mind is like below:
> >
> > 1. an interface to turn on/off dirty page tracking on VFIO device:
> > 	* vendor driver can do whatever required to enable device specific
> > dirty page tracking mechanism here
> > 	* device state is not changed here. still in running state
> >
> > 2. an interface to get dirty page bitmap
> >
> 
> I don't think there should be on/off interface for dirty page tracking.
> If there is a write access on dirty_pfns.start_addr and dirty_pfns.total
> and device_state >=VFIO_DEVICE_STATE_MIGRATION_SETUP &&
> device_state <=
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY then dirty page tracking has
> started, so return dirty page bitmap in data part of migration region.

dirty page tracking might be useful for other purposes, e.g. if people want
to just draw memory access pattern of a given VM. binding dirty tracking
to migration flow is limiting...

> 
> 
> > 3. an interface to start/stop device activity
> > 	* the effect of stop is to stop and drain in-the-fly device activities
> and
> > make device state ready for dump-out. vendor driver can do specific
> preparation
> > here
> 
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY is to stop the device, but as
> I
> mentioned above some vendor driver might have to do preparation before
> pre-copy phase starts.
> 
> > 	* the effect of start is to check validity of device state and then
> resume
> > device activities. again, vendor driver can do specific cleanup/preparation
> here
> >
> 
> That is VFIO_DEVICE_STATE_MIGRATION_RESUME.
> 
> Defined VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED and
> VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED states to cleanup
> all that
> which was allocated/mmapped/started thread during setup phase. This
> can
> be moved to transition to _RUNNING state. So if all agrees these states
> can be removed.
> 
> 
> > 4. an interface to save/restore device state
> > 	* should happen when device is stopped
> > 	* of course there is still an open how to check state compatibility as
> > Alex pointed earlier
> >
> 
> I hope above explains why other states are required.
> 

yes, above makes the whole picture much clearer. Thanks a lot!

Accordingly I'm thinking about whether below state definition could be
more general and extensible:

_STATE_NONE, indicates initial state
_STATE_RUNNING, indicates normal state
_STATE_STOPPED, indicates that device activities are fully stopped
_STATE_IN_TRACKING, indicates that device state can be r/w by user space.
this state can be ORed to RUNNING or STOPPED.

live migration could be implemented in below flow:

(at src side)
1. RUNNING -> {RUNNING | IN_TRACKING}
	* this switch does vendor specific preparation to make device
state accessible to user space (as covered by MIGRATION_SETUP)
	* vendor driver may let iterative read get incremental changes 
since last read (as covered by MIGRATION_PRECOPY). 	*open*, do we 
need an explicit flag to indicate such capability?
	* dirty page bitmap is also made available upon this change

2. (RUNNING | IN_TRACKING) -> (STOPPED | IN_TRACKING)
	* device is stopped thus device state is finalized
	* user space can read full device state, as defined for
MIGRATION_STOPNCOPY

3. (STOPPED | IN_TRACKING) -> (STOPPED)
	* device state tracking and dirty page tracking are cancelled. 
cleanup is done for resources setup in step 1. similar to MIGRATION_
SAVE_COMPLETED

4. STOPPED -> NONE, when device is reset later

(at dest side)

1. NONE -> (STOPPED | IN_TRACKING)
	* prepare device state region so user space can write
	* map to MIGRATION_RESUME
	* open: do we need both NONE and STOPPED, or just STOPPED?
2. (STOPPED | IN_TRACKING) -> STOPPED
	* clean up resources allocated in step 1
	* map to MIGRATION_RESUME_COMPLETED
3. STOPPED -> RUNNING
	* resume the device activities

compare to original definition, I think all important steps are covered:
+enum {
+    VFIO_DEVICE_STATE_NONE,
+    VFIO_DEVICE_STATE_RUNNING,
+    VFIO_DEVICE_STATE_MIGRATION_SETUP,
+    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
+    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
+    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_FAILED,
+    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
+};

FAILED is not a device state. It should be indicated in return value of set
state action.

CANCELLED can be achieved any time by clearing IN_TRACKING state.

with this new definition, above states can be also selectively used for
other purposes, e.g.:
 
1. user space can do RUNNING->STOPPED->RUNNING for any control reason,
w/o touching device state at all.

2. if someone wants to draw memory access pattern of a VM, it could
be done by RUNNING->(RUNNING | IN_TRACKING)->RUNNING, by reading
dirty bitmap when IN_TRACKING is active. Device state is ready but not 
accessed here, hope it is not a big burden.

Thoughts?

Thanks
Kevin


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Kirti Wankhede
                     ` (3 preceding siblings ...)
  2018-11-23  5:47   ` Zhao Yan
@ 2018-11-27 19:52   ` Alex Williamson
  2018-12-04 10:53     ` Cornelia Huck
  4 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2018-11-27 19:52 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	qemu-devel

On Wed, 21 Nov 2018 02:09:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined actions and members of structure usage for each action:
>     * To convey VFIO device state to be transitioned to.
>     * To get pending bytes yet to be migrated for VFIO device
>     * To ask driver to write data to migration region and return number of bytes
>       written in the region
>     * In migration resume path, user space app writes to migration region and
>       communicates it to vendor driver.
>     * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 130 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 130 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..a6e45cb2cae2 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +

I think this is copied from the vendor type, but I don't think it makes
sense here.  We reserve the top bit of the type to indicate a PCI
vendor type where the lower 16 bits are then the PCI vendor ID.  This
gives each vendor their own sub-type address space.  With the graphics
type we began our first type with (1).  I would expect migration to
then be type (2), not (1 << 30).

>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
>  
>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)

Reading ahead in the thread, I'm going to mention a lot of what Kevin
has already said here...
 
> +/**
> + * VFIO device states :
> + * VFIO User space application should set the device state to indicate vendor
> + * driver in which state the VFIO device should transitioned.
> + * - VFIO_DEVICE_STATE_NONE:
> + *   State when VFIO device is initialized but not yet running.
> + * - VFIO_DEVICE_STATE_RUNNING:
> + *   Transition VFIO device in running state, that is, user space application or
> + *   VM is active.

Is this backwards compatible?  A new device that supports migration
must be backwards compatible with old userspace that doesn't interact
with the migration region.  What happens if userspace never moves the
state to running?

> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.

What does this imply to the device?  I thought we were going to
redefine these states in terms of what we expect the device to do.
These still seem like just a copy of the QEMU states which we discussed
are an internal reference that can change at any time.

> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are running,
> + *   transition VFIO device in pre-copy state.

Which means what?

> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> + *   When VFIO user space application or VM is stopped and vCPUs are halted,
> + *   transition VFIO device in stop-and-copy state.

Is the device still running?  What happens to in-flight DMA?  Does it
wait?  We need a clear definition in terms of the device, not the VM.

> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> + *   When VFIO user space application has copied data provided by vendor driver.
> + *   This state is used by vendor driver to clean up all software state that was
> + *   setup during MIGRATION_SETUP state.

When was the MIGRATION_SAVE_STARTED?

> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> + *   Transition VFIO device to resume state, that is, start resuming VFIO device
> + *   when user space application or VM is not running and vCPUs are halted.

Are we simply restoring the we copied from the device after it was
stopped?

> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> + *   When user space application completes iterations of providing device state
> + *   data, transition device in resume completed state.

So the device should start running here?

> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> + *   Migration process failed due to some reason, transition device to failed
> + *   state. If migration process fails while saving at source, resume device at
> + *   source. If migration process fails while resuming application or VM at
> + *   destination, stop restoration at destination and resume at source.

What does a failed device do?  Can't we simply not move to the
completed state?

> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> + *   User space application has cancelled migration process either for some
> + *   known reason or due to user's intervention. Transition device to Cancelled
> + *   state, that is, resume device state as it was during running state at
> + *   source.

Why do we need to tell the device this, can't we simply go back to
normal running state?

It seems to me that the default state needs to be RUNNING in order to
be backwards compatible.

I also agree with Kevin that it seems that dirty tracking is just an
augmented running state that we can turn on and off.  Shouldn't we also
define that dirty tracking can be optional?  For instance if a device
doesn't support dirty tracking, couldn't we stop the device first, then
save all memory, then retrieve the device state?  This is the other
problem with mirroring QEMU migration states, we don't account for
things that QEMU currently doesn't do.

Actually I'm wondering if we can distill everything down to two bits,
STOPPED and LOGGING.

We start at RUNNING, the user can optionally enable LOGGING when
supported by the device to cover the SETUP and PRECOPY states
proposed.  The device stays running, but activates any sort of
dirtly page tracking that's necessary to activate those interfaces.
LOGGING can also be cleared to handle the CANCELLED state.  The user
would set STOPPED which should quiesce the device and make the full
device state available through the device data section.  Clearing
STOPPED and LOGGING would handle the FAILED state below.  Likewise on
the migration target, QEMU would set the device top STOPPED in order to
write the incoming data via the data section and clear STOPPED to
indicate the device returns to RUNNING (aka RESUME/RESUME_COMPLETED).

> + */
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information.

Should include a note that field accesses are only supported at their
native width and alignment, anything else is unsupported and should
generate an error.  I don't see a good reason to bloat the supporting
code to handle anything else.

> + *
> + * Action Set state:
> + *      To tell vendor driver the state VFIO device should be transitioned to.
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change, the state to which VFIO device should be
> + *           transitioned to.

It should be noted that the write will return error on a transition
failure.  The write is also synchronous, ie. when the device is asked
to stop, any in-flight DMA will be completed before the write returns.

> + *
> + * Action Get pending bytes:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      pending.threshold_size [Input] : threshold of buffer in User space app.

I still don't see why the kernel needs to be concerned with the size of
the user buffer.  The vendor driver will already know the size of the
user read, won't the user try to fill their buffer in a single read?
Infer the size.  Maybe you really want a minimum read size if you want
to package your vendor data in some way?  ie. what minimum size must the
user use to avoid getting -ENOSPC return.

> + *      pending.precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          user space application or VM start. In case of migration, this
> + *          indicates pending bytes to be transfered while application or VM or
> + *          vCPUs are active and running.

What sort of data is included here?  Is this mostly a compatibility
check?  I think that needs to be an explicit part of the interface, not
something we simply assume the vendor driver handles (though it's
welcome to do additional checking).  What other device state is valid
to be saved prior to stopping the device?

> + *      pending.compatible [output] : pending data which may be migrated any
> + *          time , either when application or VM is active and vCPUs are active
> + *          or when application or VM is halted and vCPUs are halted.

Again, what sort of data is included here?  If it's live device data,
like a framebuffer, shouldn't it be handled by the dirty page tracking
interface?  Is this meant to do dirty tracking within device memory?
Should we formalize that?

> + *      pending.postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           application or VM stopped and vCPUs are halted.
> + *      Sum of pending.precopy_only, pending.compatible and
> + *      pending.postcopy_only is the whole amount of pending data.

It seems the user is able to stop the device at any point in time, what
do these values indicate then?  Shouldn't there be just one value then?

Can't we do all of this with just a save_bytes_available value?  When
the device is RUNNING this value could be dynamic (if the vendor driver
supports data in that phase... and we understand how to consume it),
when the device is stopped, it could update with each read.

> + *
> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.

If we know the pending bytes, why do we need this?  Isn't the read
itself the indication to prepare the data to be read?  Does the user
really ever need to start a read from anywhere other than the starting
offset of the data section?

> + *
> + * Action Set buffer:
> + *      In migration resume path, user space app writes to migration region and
> + *      communicates it to vendor driver with this action.
> + *      data.offset [Input] : offset in the region from where data is written.
> + *      data.size [Input] : number of bytes written in migration buffer by
> + *          user space app.

Again, isn't all the information contained in the write itself?
Shouldn't the packet of data the user writes include information that
makes the offset unnecessary?  Are all of these trying to support an
mmap capable data area and do we really need that?

> + *
> + * Action Get dirty pages bitmap:
> + *      Get bitmap of dirty pages from vendor driver from given start address.
> + *      dirty_pfns.start_addr [Input] : start address
> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
> + *          dirty bitmap is requested
> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is copied
> + *          to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + */

The protocol is not very clear here, the vendor driver never copies the
bitmap, from the user perspective the vendor driver handles the read(2)
from the region.  But is the data data.offset being used for this?
It's not clear where the user reads the bitmap.  Is the start_addr here
meant to address the segmentation that we discussed previously?  As
above, I don't see why the user needs all these input fields, they can
almost all be determined by the read itself.

The format I proposed was that much like the data section, the device
could expose a dirty pfn section with offset and size.  For example:

struct {
	__u64 offset;		// read-only (in bytes)
	__u64 size;		// read-only (in bytes)
	__u64 page_size;	// read-only (ex. 4k)
	__u64 segment;		// read-write
} dirty_pages;

So for example, the vendor driver would expose an offset within the
region much like for the data area.  The size might be something like
32MB and the page_size could be 4096.  The user can calculate from this
that the area exposes 1TB worth of pfns.  When segment is 0x0 the user
can directly read pfns for address 0x0 to 1TB - 1.  Setting segment to
0x1 allows access to 1TB to 2TB - 1, etc.  Therefore the user sets the
segment register and simply performs a read.

The thing we discussed that this interface lacks is an efficient
interface for handling reading from a range where no dirty bits are
set, which I think you're trying to handle with the additional 'copied'
field, but couldn't read(2) simply return zero to optimize that case?
We only need to ensure that the user won't continue to retry for that
return value.

> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;
> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));
> +

We're still missing explicit versioning and compatibility information.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices Kirti Wankhede
                     ` (2 preceding siblings ...)
  2018-11-22 19:57   ` Dr. David Alan Gilbert
@ 2018-11-29  8:04   ` Zhao Yan
  2018-12-17 11:19   ` Gonglei (Arei)
  4 siblings, 0 replies; 32+ messages in thread
From: Zhao Yan @ 2018-11-29  8:04 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, Zhengxiao.zx, Tian, Kevin, Liu, Yi L,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	dgilbert, Wang, Zhi A, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Wed, Nov 21, 2018 at 04:39:41AM +0800, Kirti Wankhede wrote:
> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supportd migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read data offset and size using structure
>   vfio_device_migration_info.data, accordingly copy data from the region.
> - To restore device state, write data offset and size in the structure and write
>   data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count of
>   pfns copied and accordingly read those from the rest of the region or mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 729 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 753 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index a2e7a0a7cf02..2cf2ba1440f2 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,5 +1,5 @@
>  ifeq ($(CONFIG_LINUX), y)
> -obj-$(CONFIG_SOFTMMU) += common.o
> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..717fb63e4f43
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,729 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2018
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.buffer.size) {
> +        vfio_region_exit(&migration->region.buffer);
> +        vfio_region_finalize(&migration->region.buffer);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!migration) {
> +        return ret;
> +    }
> +
> +    /* Migration support added for PCI device only */
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        obj = vfio_pci_get_object(vbasedev);
> +    }
> +
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
> +                            migration->region.index, "migration");
> +    if (ret) {
> +        error_report("Failed to setup VFIO migration region %d: %s",
> +                      migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.buffer.size) {
> +        ret = -EINVAL;
> +        error_report("Invalid region size of VFIO migration region %d: %s",
> +                     migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (migration->region.buffer.mmaps) {
> +        ret = vfio_region_mmap(&migration->region.buffer);
> +        if (ret) {
> +            error_report("Failed to mmap VFIO migration region %d: %s",
> +                         migration->region.index, strerror(-ret));
> +            goto err;
> +        }
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    int ret = 0;
> +
> +    if (vbasedev->device_state == state) {
> +        return ret;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &state, sizeof(state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("Failed to set migration state %d %s",
> +                     ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    vbasedev->device_state = state;
> +    return ret;
> +}
> +
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> +                              uint64_t start_addr,
> +                              uint64_t pfn_count)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    uint64_t count = 0;
> +    int ret;
> +
> +    migration_info.dirty_pfns.start_addr = start_addr;
> +    migration_info.dirty_pfns.total = pfn_count;
> +
> +    ret = pwrite(vbasedev->fd, &migration_info.dirty_pfns,
> +                 sizeof(migration_info.dirty_pfns),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              dirty_pfns));
> +    if (ret < 0) {
> +        error_report("Failed to set dirty pages start address %d %s",
> +                ret, strerror(errno));
> +        return;
> +    }
> +
> +    do {
> +        /* Read dirty_pfns.copied */
> +        ret = pread(vbasedev->fd, &migration_info.dirty_pfns,
> +                sizeof(migration_info.dirty_pfns),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             dirty_pfns));
> +        if (ret < 0) {
> +            error_report("Failed to get dirty pages bitmap count %d %s",
> +                    ret, strerror(errno));
> +            return;
> +        }
> +
> +        if (migration_info.dirty_pfns.copied) {
> +            uint64_t bitmap_size;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +
> +            bitmap_size = (BITS_TO_LONGS(migration_info.dirty_pfns.copied) + 1)
> +                           * sizeof(unsigned long);
> +
> +            if (region->mmaps) {
> +                int i;
> +
> +                for (i = 0; i < region->nr_mmaps; i++) {
> +                    if (region->mmaps[i].size >= bitmap_size) {
> +                        buf = region->mmaps[i].mmap;
> +                        buffer_mmaped = true;
> +                        break;
> +                    }
> +                }
> +            }
> +
> +            if (!buffer_mmaped) {
> +                buf = g_malloc0(bitmap_size);
> +
> +                ret = pread(vbasedev->fd, buf, bitmap_size,
> +                            region->fd_offset + sizeof(migration_info) + 1);
> +                if (ret != bitmap_size) {
> +                    error_report("Failed to get migration data %d", ret);
> +                    g_free(buf);
> +                    return;
> +                }
> +            }
> +
> +            cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> +                                        start_addr + (count * TARGET_PAGE_SIZE),
> +                                        migration_info.dirty_pfns.copied);
Seems this part of dirty memory is marked dirty to ask qemu to copy it.
But in vfio_save_iterate(), vfio_save_buffer() is also called to ask vendor
driver to save device data and copy.
What's the relationship between the two parts of data? one in system memory
and the other in standalone device memory?


> +            count +=  migration_info.dirty_pfns.copied;
> +
> +            if (!buffer_mmaped) {
> +                g_free(buf);
> +            }
> +        }
> +    } while (count < migration_info.dirty_pfns.total);
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_save_config(vbasedev, f);
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}
> +
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_load_config(vbasedev, f);
> +    }
> +
> +    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("Wrong end of block while loading device config space");
> +        return -EINVAL;
> +    }
> +
> +    return qemu_file_get_error(f);
> +}
> +
> +/* ---------------------------------------------------------------------- */
> +
> +static bool vfio_is_active_iterate(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->vm_running && vbasedev->migration &&
> +        (vbasedev->migration->pending_precopy_only != 0))
> +        return true;
> +
> +    if (!vbasedev->vm_running && vbasedev->migration &&
> +        (vbasedev->migration->pending_postcopy != 0))
> +        return true;
> +
> +    return false;
> +}
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    qemu_mutex_lock_iothread();
> +    ret = vfio_migration_region_init(vbasedev);
> +    qemu_mutex_unlock_iothread();
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &migration_info.data,
> +                sizeof(migration_info.data),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data));
> +    if (ret != sizeof(migration_info.data)) {
> +        error_report("Failed to get migration buffer information %d",
> +                     ret);
> +        return -EINVAL;
> +    }
> +
> +    if (migration_info.data.size) {
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        if (region->mmaps) {
> +            int i;
> +
> +            for (i = 0; i < region->nr_mmaps; i++) {
> +                if (region->mmaps[i].offset == migration_info.data.offset) {
> +                    buf = region->mmaps[i].mmap;
> +                    buffer_mmaped = true;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        if (!buffer_mmaped) {
> +            buf = g_malloc0(migration_info.data.size);
> +            ret = pread(vbasedev->fd, buf, migration_info.data.size,
> +                        region->fd_offset + migration_info.data.offset);
> +            if (ret != migration_info.data.size) {
> +                error_report("Failed to get migration data %d", ret);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, migration_info.data.size);
> +        qemu_put_buffer(f, buf, migration_info.data.size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +
> +    } else {
> +        qemu_put_be64(f, migration_info.data.size);
> +    }
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return migration_info.data.size;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    ret = vfio_save_buffer(f, vbasedev);
> +    if (ret < 0) {
> +        error_report("vfio_save_buffer failed %s",
> +                     strerror(errno));
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return ret;
> +}
> +
> +static void vfio_update_pending(VFIODevice *vbasedev, uint64_t threshold_size)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    struct vfio_device_migration_info migration_info;
> +    int ret;
> +
> +    ret = pwrite(vbasedev->fd, &threshold_size, sizeof(threshold_size),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              pending.threshold_size));
> +    if (ret < 0) {
> +        error_report("Failed to set threshold size %d %s",
> +                     ret, strerror(errno));
> +        return;
> +    }
> +
> +    ret = pread(vbasedev->fd, &migration_info.pending,
> +                sizeof(migration_info.pending),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending));
> +    if (ret != sizeof(migration_info.pending)) {
> +        error_report("Failed to get pending bytes %d", ret);
> +        return;
> +    }
> +
> +    migration->pending_precopy_only = migration_info.pending.precopy_only;
> +    migration->pending_compatible = migration_info.pending.compatible;
> +    migration->pending_postcopy = migration_info.pending.postcopy_only;
> +
> +    return;
> +}
> +
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    vfio_update_pending(vbasedev, threshold_size);
> +
> +    *res_precopy_only += migration->pending_precopy_only;
> +    *res_compatible += migration->pending_compatible;
> +    *res_postcopy_only += migration->pending_postcopy;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    MigrationState *ms = migrate_get_current();
> +    int ret;
> +
> +    if (vbasedev->vm_running) {
> +        vbasedev->vm_running = 0;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                   VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> +    if (ret) {
> +        error_report("Failed to set state STOPNCOPY_ACTIVE");
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    do {
> +        vfio_update_pending(vbasedev, ms->threshold_size);
> +
> +        if (vfio_is_active_iterate(opaque)) {
> +            ret = vfio_save_buffer(f, vbasedev);
> +            if (ret < 0) {
> +                error_report("Failed to save buffer");
> +                break;
> +            } else if (ret == 0) {
> +                break;
> +            }
> +        }
> +    } while ((migration->pending_compatible + migration->pending_postcopy) > 0);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                   VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED);
> +    if (ret) {
> +        error_report("Failed to set state SAVE_COMPLETED");
> +        return ret;
> +    }
> +    return ret;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_migration_region_exit(vbasedev);
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +    uint64_t data;
> +
> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return 0;
> +            } else {
> +                error_report("SETUP STATE: EOS not found 0x%lx", data);
> +                return -EINVAL;
> +            }
> +        } else if (data != 0) {
> +            VFIOMigration *migration = vbasedev->migration;
> +            VFIORegion *region = &migration->region.buffer;
> +            struct vfio_device_migration_info migration_info;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +
> +            migration_info.data.size = data;
> +
> +            if (region->mmaps) {
> +                int i;
> +
> +                for (i = 0; i < region->nr_mmaps; i++) {
> +                    if (region->mmaps[i].mmap &&
> +                        (region->mmaps[i].size >= data)) {
> +                        buf = region->mmaps[i].mmap;
> +                        migration_info.data.offset = region->mmaps[i].offset;
> +                        buffer_mmaped = true;
> +                        break;
> +                    }
> +                }
> +            }
> +
> +            if (!buffer_mmaped) {
> +                buf = g_malloc0(migration_info.data.size);
> +                migration_info.data.offset = sizeof(migration_info) + 1;
> +            }
> +
> +            qemu_get_buffer(f, buf, data);
> +
> +            ret = pwrite(vbasedev->fd, &migration_info.data,
> +                         sizeof(migration_info.data),
> +                         region->fd_offset +
> +                         offsetof(struct vfio_device_migration_info, data));
> +            if (ret != sizeof(migration_info.data)) {
> +                error_report("Failed to set migration buffer information %d",
> +                        ret);
> +                return -EINVAL;
> +            }
> +
> +            if (!buffer_mmaped) {
> +                ret = pwrite(vbasedev->fd, buf, migration_info.data.size,
> +                             region->fd_offset + migration_info.data.offset);
> +                g_free(buf);
> +
> +                if (ret != migration_info.data.size) {
> +                    error_report("Failed to set migration buffer %d", ret);
> +                    return -EINVAL;
> +                }
> +            }
> +        }
> +
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +        data = qemu_get_be64(f);
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                    VFIO_DEVICE_STATE_MIGRATION_RESUME);
> +    if (ret) {
> +        error_report("Failed to set state RESUME");
> +    }
> +
> +    ret = vfio_migration_region_init(vbasedev);
> +    if (ret) {
> +        error_report("Failed to initialise migration region");
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret = 0;
> +
> +    ret = vfio_migration_set_state(vbasedev,
> +                                 VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED);
> +    if (ret) {
> +        error_report("Failed to set state RESUME_COMPLETED");
> +    }
> +
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_pending = vfio_save_pending,
> +    .save_cleanup = vfio_save_cleanup,
> +    .load_state = vfio_load_state,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .is_active_iterate = vfio_is_active_iterate,
> +};
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running) && running) {
> +        int ret;
> +
> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("Failed to set state RUNNING");
> +        }
> +    }
> +
> +    vbasedev->vm_running = running;
> +}
> +
> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_SETUP:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_SETUP);
> +        if (ret) {
> +            error_report("Failed to set state SETUP");
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_ACTIVE:
> +        if (vbasedev->device_state == VFIO_DEVICE_STATE_MIGRATION_SETUP) {
> +            if (vbasedev->vm_running) {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                          VFIO_DEVICE_STATE_MIGRATION_PRECOPY);
> +                if (ret) {
> +                    error_report("Failed to set state PRECOPY_ACTIVE");
> +                }
> +            } else {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                        VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY);
> +                if (ret) {
> +                    error_report("Failed to set state STOPNCOPY_ACTIVE");
> +                }
> +            }
> +        } else {
> +            ret = vfio_migration_set_state(vbasedev,
> +                                           VFIO_DEVICE_STATE_MIGRATION_RESUME);
> +            if (ret) {
> +                error_report("Failed to set state RESUME");
> +            }
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_CANCELLED);
> +        if (ret) {
> +            error_report("Failed to set state CANCELLED");
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                                       VFIO_DEVICE_STATE_MIGRATION_FAILED);
> +        if (ret) {
> +            error_report("Failed to set state FAILED");
> +        }
> +        return;
> +    }
> +}
> +
> +static int vfio_migration_init(VFIODevice *vbasedev,
> +                               struct vfio_region_info *info)
> +{
> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> +    vbasedev->migration->region.index = info->index;
> +
> +    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
> +
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> +
> +    return 0;
> +}
> +
> +
> +/* ---------------------------------------------------------------------- */
> +
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> +{
> +    struct vfio_region_info *info;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    if (ret) {
> +        Error *local_err = NULL;
> +
> +        error_setg(&vbasedev->migration_blocker,
> +                   "VFIO device doesn't support migration");
> +        ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            error_free(vbasedev->migration_blocker);
> +            return ret;
> +        }
> +    } else {
> +        return vfio_migration_init(vbasedev, info);
> +    }
> +
> +    return 0;
> +}
> +
> +void vfio_migration_finalize(VFIODevice *vbasedev)
> +{
> +    if (!vbasedev->migration) {
> +        return;
> +    }
> +
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
> +    if (vbasedev->migration_blocker) {
> +        migrate_del_blocker(vbasedev->migration_blocker);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +
> +    g_free(vbasedev->migration);
> +}
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a9036929b220..ab8217c9e249 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -30,6 +30,8 @@
>  #include <linux/vfio.h>
>  #endif
> 
> +#include "sysemu/sysemu.h"
> +
>  #define ERR_PREFIX "vfio error: %s: "
>  #define WARN_PREFIX "vfio warning: %s: "
> 
> @@ -57,6 +59,16 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */
>  } VFIORegion;
> 
> +typedef struct VFIOMigration {
> +    struct {
> +        VFIORegion buffer;
> +        uint32_t index;
> +    } region;
> +    uint64_t pending_precopy_only;
> +    uint64_t pending_compatible;
> +    uint64_t pending_postcopy;
> +} VFIOMigration;
> +
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers;
> @@ -116,6 +128,12 @@ typedef struct VFIODevice {
>      unsigned int num_irqs;
>      unsigned int num_regions;
>      unsigned int flags;
> +    uint32_t device_state;
> +    VMChangeStateEntry *vm_state;
> +    int vm_running;
> +    Notifier migration_state;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>  } VFIODevice;
> 
>  struct VFIODeviceOps {
> @@ -193,4 +211,9 @@ int vfio_spapr_create_window(VFIOContainer *container,
>  int vfio_spapr_remove_window(VFIOContainer *container,
>                               hwaddr offset_within_address_space);
> 
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> +void vfio_migration_finalize(VFIODevice *vbasedev);
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_addr,
> +                               uint64_t pfn_count);
> +
>  #endif /* HW_VFIO_VFIO_COMMON_H */
> --
> 2.7.0
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-11-27 19:52   ` Alex Williamson
@ 2018-12-04 10:53     ` Cornelia Huck
  2018-12-04 17:14       ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Cornelia Huck @ 2018-12-04 10:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, qemu-devel

On Tue, 27 Nov 2018 12:52:48 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> Actually I'm wondering if we can distill everything down to two bits,
> STOPPED and LOGGING.
> 
> We start at RUNNING, the user can optionally enable LOGGING when
> supported by the device to cover the SETUP and PRECOPY states
> proposed.  The device stays running, but activates any sort of
> dirtly page tracking that's necessary to activate those interfaces.
> LOGGING can also be cleared to handle the CANCELLED state.  The user
> would set STOPPED which should quiesce the device and make the full
> device state available through the device data section.  Clearing
> STOPPED and LOGGING would handle the FAILED state below.  Likewise on
> the migration target, QEMU would set the device top STOPPED in order to
> write the incoming data via the data section and clear STOPPED to
> indicate the device returns to RUNNING (aka RESUME/RESUME_COMPLETED).

This idea sounds like something that can be more easily adapted to
other device types as well. The LOGGING bit is probably more flexible
if you reframe it as a PREPARATION bit: That would also cover devices
or device types that don't do dirty logging, but need some other kind
of preparation prior to moving over.

This model would also be simple enough to allow e.g. a vendor driver
for mdev to implement its own, specialized backend while still fitting
into the general framework. Non-pci mdevs are probably different enough
from pci devices so that many of the states proposed don't really match
their needs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
  2018-12-04 10:53     ` Cornelia Huck
@ 2018-12-04 17:14       ` Alex Williamson
  0 siblings, 0 replies; 32+ messages in thread
From: Alex Williamson @ 2018-12-04 17:14 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Kirti Wankhede, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, qemu-devel

On Tue, 4 Dec 2018 11:53:33 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 27 Nov 2018 12:52:48 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > Actually I'm wondering if we can distill everything down to two bits,
> > STOPPED and LOGGING.
> > 
> > We start at RUNNING, the user can optionally enable LOGGING when
> > supported by the device to cover the SETUP and PRECOPY states
> > proposed.  The device stays running, but activates any sort of
> > dirtly page tracking that's necessary to activate those interfaces.
> > LOGGING can also be cleared to handle the CANCELLED state.  The user
> > would set STOPPED which should quiesce the device and make the full
> > device state available through the device data section.  Clearing
> > STOPPED and LOGGING would handle the FAILED state below.  Likewise on
> > the migration target, QEMU would set the device top STOPPED in order to
> > write the incoming data via the data section and clear STOPPED to
> > indicate the device returns to RUNNING (aka RESUME/RESUME_COMPLETED).  
> 
> This idea sounds like something that can be more easily adapted to
> other device types as well. The LOGGING bit is probably more flexible
> if you reframe it as a PREPARATION bit: That would also cover devices
> or device types that don't do dirty logging, but need some other kind
> of preparation prior to moving over.

Can you elaborate on what PREPARATION might do w/o dirty logging?
LOGGING is just a state, it's on or off, whereas PREPARATION implies
some sequential step in a process and then I'm afraid we slide back into
states that a really steps in a QEMU specific migration process.  So
I'm curious why PREPARATION wouldn't just be a vendor implementation
specific first step when RUNNING is turned off.  A device that doesn't
implement dirty logging could always just claim everything is dirty if
it wants that advanced warning that RUNNING might be turned off soon,
but there are probably more efficient ways to support that, ex. a flag
indicating the dirty logging granularity.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-11-20 20:39 ` [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices Kirti Wankhede
                     ` (3 preceding siblings ...)
  2018-11-29  8:04   ` Zhao Yan
@ 2018-12-17 11:19   ` Gonglei (Arei)
  2018-12-19  2:12     ` Zhao Yan
  4 siblings, 1 reply; 32+ messages in thread
From: Gonglei (Arei) @ 2018-12-17 11:19 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	changpeng.liu, Ken.Xue, Zhao, Yan Y, Hu, Robert, Huangzhichao,
	Liujinsong (Paul)

Hi,

It's great to see this patch series, which is a very important step, although 
currently only consider GPU mdev devices to support hot migration. 

However, this is based on the VFIO framework after all, so we expect 
that we can make this live migration framework more general.

For example, the vfio_save_pending() callback is used to obtain device
memory (such as GPU memory), but if the device (such as network card) 
has no special proprietary memory, but only system memory? 
It is too much to perform a null operation for this kind of device by writing
memory to the vendor driver of kernel space. 

I think we can acquire the capability from the vendor driver before using this. 
If there is device memory that needs iterative copying, the vendor driver return
ture, otherwise return false. Then QEMU implement the specific logic, 
otherwise return directly. Just like getting the capability list of KVM
module, can we?


Regards,
-Gonglei


> -----Original Message-----
> From: Qemu-devel
> [mailto:qemu-devel-bounces+arei.gonglei=huawei.com@nongnu.org] On
> Behalf Of Kirti Wankhede
> Sent: Wednesday, November 21, 2018 4:40 AM
> To: alex.williamson@redhat.com; cjia@nvidia.com
> Cc: Zhengxiao.zx@Alibaba-inc.com; kevin.tian@intel.com; yi.l.liu@intel.com;
> eskultet@redhat.com; ziye.yang@intel.com; qemu-devel@nongnu.org;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; Kirti Wankhede <kwankhede@nvidia.com>;
> eauger@redhat.com; felipe@nutanix.com; jonathan.davies@nutanix.com;
> changpeng.liu@intel.com; Ken.Xue@amd.com
> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> 
> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supportd migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read data offset and size using structure
>   vfio_device_migration_info.data, accordingly copy data from the region.
> - To restore device state, write data offset and size in the structure and write
>   data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count of
>   pfns copied and accordingly read those from the rest of the region or
> mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 729
> ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 753 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
[skip]

> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_pending = vfio_save_pending,
> +    .save_cleanup = vfio_save_cleanup,
> +    .load_state = vfio_load_state,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .is_active_iterate = vfio_is_active_iterate,
> +};
> +

 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-12-17 11:19   ` Gonglei (Arei)
@ 2018-12-19  2:12     ` Zhao Yan
  2018-12-21  7:36       ` Zhi Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Zhao Yan @ 2018-12-19  2:12 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: Kirti Wankhede, alex.williamson, cjia, Zhengxiao.zx, kevin.tian,
	yi.l.liu, eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst,
	dgilbert, zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, Hu, Robert,
	Huangzhichao, Liujinsong (Paul)

right, a capabilities field in struct vfio_device_migration_info can avoid
populating iteration APIs and migration states into every vendor drivers
who actually may not requires those APIs and simply do nothing or return
value 0 in response to those APIs.

struct vfio_device_migration_info {
        __u32 device_state;         /* VFIO device state */
+     __u32 capabilities;    /* VFIO device capabilities */
        struct {
            __u64 precopy_only;
            __u64 compatible;
            __u64 postcopy_only;
            __u64 threshold_size;
        } pending;	
     ...
};
 
So, only for devices who need iteration APIs, like GPU with standalone
video memory, can set flag VFIO_MIGRATION_HAS_ITERTATION to this
capabilities field. Then callbacks like save_live_iterate(),
is_active_iterate(), save_live_pending() will check the flag
VFIO_MIGRATION_HAS_ITERTATION in capabilities field and send requests
into vendor driver.

But, for simple devices who only use system memory, like IGD and NIC,
will not set the flag VFIO_MIGRATION_HAS_ITERTATION, and as a result, no
need to handle requests like "Get buffer", "Set buffer", "Get pending
bytes" triggered by QEMU iteration callbacks. And therefore, detailed
migration states are not cared for vendor drivers for these devices.

Thanks to Gonglei for providing this idea and details.
Free free to give your comments to the above description.


On Mon, Dec 17, 2018 at 11:19:49AM +0000, Gonglei (Arei) wrote:
> Hi,
> 
> It's great to see this patch series, which is a very important step, although 
> currently only consider GPU mdev devices to support hot migration. 
> 
> However, this is based on the VFIO framework after all, so we expect 
> that we can make this live migration framework more general.
> 
> For example, the vfio_save_pending() callback is used to obtain device
> memory (such as GPU memory), but if the device (such as network card) 
> has no special proprietary memory, but only system memory? 
> It is too much to perform a null operation for this kind of device by writing
> memory to the vendor driver of kernel space. 
> 
> I think we can acquire the capability from the vendor driver before using this. 
> If there is device memory that needs iterative copying, the vendor driver return
> ture, otherwise return false. Then QEMU implement the specific logic, 
> otherwise return directly. Just like getting the capability list of KVM
> module, can we?
> 
> 
> Regards,
> -Gonglei
> 
> 
> > -----Original Message-----
> > From: Qemu-devel
> > [mailto:qemu-devel-bounces+arei.gonglei=huawei.com@nongnu.org] On
> > Behalf Of Kirti Wankhede
> > Sent: Wednesday, November 21, 2018 4:40 AM
> > To: alex.williamson@redhat.com; cjia@nvidia.com
> > Cc: Zhengxiao.zx@Alibaba-inc.com; kevin.tian@intel.com; yi.l.liu@intel.com;
> > eskultet@redhat.com; ziye.yang@intel.com; qemu-devel@nongnu.org;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; Kirti Wankhede <kwankhede@nvidia.com>;
> > eauger@redhat.com; felipe@nutanix.com; jonathan.davies@nutanix.com;
> > changpeng.liu@intel.com; Ken.Xue@amd.com
> > Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> > 
> > - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> > - Added SaveVMHandlers and implemented all basic functions required for live
> >   migration.
> > - Added VM state change handler to know running or stopped state of VM.
> > - Added migration state change notifier to get notification on migration state
> >   change. This state is translated to VFIO device state and conveyed to vendor
> >   driver.
> > - VFIO device supportd migration or not is decided based of migration region
> >   query. If migration region query is successful then migration is supported
> >   else migration is blocked.
> > - Structure vfio_device_migration_info is mapped at 0th offset of migration
> >   region and should always trapped by VFIO device's driver. Added both type of
> >   access support, trapped or mmapped, for data section of the region.
> > - To save device state, read data offset and size using structure
> >   vfio_device_migration_info.data, accordingly copy data from the region.
> > - To restore device state, write data offset and size in the structure and write
> >   data in the region.
> > - To get dirty page bitmap, write start address and pfn count then read count of
> >   pfns copied and accordingly read those from the rest of the region or
> > mmaped
> >   part of the region. This copy is iterated till page bitmap for all requested
> >   pfns are copied.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/migration.c           | 729
> > ++++++++++++++++++++++++++++++++++++++++++
> >  include/hw/vfio/vfio-common.h |  23 ++
> >  3 files changed, 753 insertions(+), 1 deletion(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> [skip]
> 
> > +
> > +static SaveVMHandlers savevm_vfio_handlers = {
> > +    .save_setup = vfio_save_setup,
> > +    .save_live_iterate = vfio_save_iterate,
> > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > +    .save_live_pending = vfio_save_pending,
> > +    .save_cleanup = vfio_save_cleanup,
> > +    .load_state = vfio_load_state,
> > +    .load_setup = vfio_load_setup,
> > +    .load_cleanup = vfio_load_cleanup,
> > +    .is_active_iterate = vfio_is_active_iterate,
> > +};
> > +
> 
>  

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
  2018-12-19  2:12     ` Zhao Yan
@ 2018-12-21  7:36       ` Zhi Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Zhi Wang @ 2018-12-21  7:36 UTC (permalink / raw)
  To: Zhao Yan, Gonglei (Arei)
  Cc: Kirti Wankhede, alex.williamson, cjia, Zhengxiao.zx, kevin.tian,
	yi.l.liu, eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst,
	dgilbert, mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	changpeng.liu, Ken.Xue, Hu, Robert, Huangzhichao,
	Liujinsong (Paul)

It's nice to see cloud vendors are also quite interested in VFIO 
migration interfaces and functions. From what Yan said and Huawei's 
requirements, there should be more devices which don't have private 
memory, maybe GPU is almost the only one which has the private memory.

As VFIO is a generic user-space device controlling interfaces nowadays 
in the kernel and perhaps becomes into an standard in future, I guess we 
also need to think more about a generic framework and how to let the 
non-GPU devices to step into VFIO easily.

 From perspective of the vendors of the devices and the cloud vendors 
who want to build their migration support on top of VFIO, it would be 
nice to have a simple and friendly path for them.

Thanks,
Zhi.

On 12/18/18 9:12 PM, Zhao Yan wrote:
> right, a capabilities field in struct vfio_device_migration_info can avoid
> populating iteration APIs and migration states into every vendor drivers
> who actually may not requires those APIs and simply do nothing or return
> value 0 in response to those APIs.
> 
> struct vfio_device_migration_info {
>          __u32 device_state;         /* VFIO device state */
> +     __u32 capabilities;    /* VFIO device capabilities */
>          struct {
>              __u64 precopy_only;
>              __u64 compatible;
>              __u64 postcopy_only;
>              __u64 threshold_size;
>          } pending;	
>       ...
> };
>   
> So, only for devices who need iteration APIs, like GPU with standalone
> video memory, can set flag VFIO_MIGRATION_HAS_ITERTATION to this
> capabilities field. Then callbacks like save_live_iterate(),
> is_active_iterate(), save_live_pending() will check the flag
> VFIO_MIGRATION_HAS_ITERTATION in capabilities field and send requests
> into vendor driver.
> 
> But, for simple devices who only use system memory, like IGD and NIC,
> will not set the flag VFIO_MIGRATION_HAS_ITERTATION, and as a result, no
> need to handle requests like "Get buffer", "Set buffer", "Get pending
> bytes" triggered by QEMU iteration callbacks. And therefore, detailed
> migration states are not cared for vendor drivers for these devices.
> 
> Thanks to Gonglei for providing this idea and details.
> Free free to give your comments to the above description.
> 
> 
> On Mon, Dec 17, 2018 at 11:19:49AM +0000, Gonglei (Arei) wrote:
>> Hi,
>>
>> It's great to see this patch series, which is a very important step, although
>> currently only consider GPU mdev devices to support hot migration.
>>
>> However, this is based on the VFIO framework after all, so we expect
>> that we can make this live migration framework more general.
>>
>> For example, the vfio_save_pending() callback is used to obtain device
>> memory (such as GPU memory), but if the device (such as network card)
>> has no special proprietary memory, but only system memory?
>> It is too much to perform a null operation for this kind of device by writing
>> memory to the vendor driver of kernel space.
>>
>> I think we can acquire the capability from the vendor driver before using this.
>> If there is device memory that needs iterative copying, the vendor driver return
>> ture, otherwise return false. Then QEMU implement the specific logic,
>> otherwise return directly. Just like getting the capability list of KVM
>> module, can we?
>>
>>
>> Regards,
>> -Gonglei
>>
>>
>>> -----Original Message-----
>>> From: Qemu-devel
>>> [mailto:qemu-devel-bounces+arei.gonglei=huawei.com@nongnu.org] On
>>> Behalf Of Kirti Wankhede
>>> Sent: Wednesday, November 21, 2018 4:40 AM
>>> To: alex.williamson@redhat.com; cjia@nvidia.com
>>> Cc: Zhengxiao.zx@Alibaba-inc.com; kevin.tian@intel.com; yi.l.liu@intel.com;
>>> eskultet@redhat.com; ziye.yang@intel.com; qemu-devel@nongnu.org;
>>> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
>>> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
>>> aik@ozlabs.ru; Kirti Wankhede <kwankhede@nvidia.com>;
>>> eauger@redhat.com; felipe@nutanix.com; jonathan.davies@nutanix.com;
>>> changpeng.liu@intel.com; Ken.Xue@amd.com
>>> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
>>>
>>> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
>>> - Added SaveVMHandlers and implemented all basic functions required for live
>>>    migration.
>>> - Added VM state change handler to know running or stopped state of VM.
>>> - Added migration state change notifier to get notification on migration state
>>>    change. This state is translated to VFIO device state and conveyed to vendor
>>>    driver.
>>> - VFIO device supportd migration or not is decided based of migration region
>>>    query. If migration region query is successful then migration is supported
>>>    else migration is blocked.
>>> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>>>    region and should always trapped by VFIO device's driver. Added both type of
>>>    access support, trapped or mmapped, for data section of the region.
>>> - To save device state, read data offset and size using structure
>>>    vfio_device_migration_info.data, accordingly copy data from the region.
>>> - To restore device state, write data offset and size in the structure and write
>>>    data in the region.
>>> - To get dirty page bitmap, write start address and pfn count then read count of
>>>    pfns copied and accordingly read those from the rest of the region or
>>> mmaped
>>>    part of the region. This copy is iterated till page bitmap for all requested
>>>    pfns are copied.
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>> ---
>>>   hw/vfio/Makefile.objs         |   2 +-
>>>   hw/vfio/migration.c           | 729
>>> ++++++++++++++++++++++++++++++++++++++++++
>>>   include/hw/vfio/vfio-common.h |  23 ++
>>>   3 files changed, 753 insertions(+), 1 deletion(-)
>>>   create mode 100644 hw/vfio/migration.c
>>>
>> [skip]
>>
>>> +
>>> +static SaveVMHandlers savevm_vfio_handlers = {
>>> +    .save_setup = vfio_save_setup,
>>> +    .save_live_iterate = vfio_save_iterate,
>>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>>> +    .save_live_pending = vfio_save_pending,
>>> +    .save_cleanup = vfio_save_cleanup,
>>> +    .load_state = vfio_load_state,
>>> +    .load_setup = vfio_load_setup,
>>> +    .load_cleanup = vfio_load_cleanup,
>>> +    .is_active_iterate = vfio_is_active_iterate,
>>> +};
>>> +
>>
>>   

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-12-21  7:37 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-20 20:39 [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Kirti Wankhede
2018-11-20 20:39 ` [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Kirti Wankhede
2018-11-21  0:26   ` Tian, Kevin
2018-11-21  4:24     ` Kirti Wankhede
2018-11-21  6:13       ` Tian, Kevin
2018-11-22 20:01         ` Kirti Wankhede
2018-11-26  7:14           ` Tian, Kevin
2018-11-21 17:26   ` Pierre Morel
2018-11-22 18:54   ` Dr. David Alan Gilbert
2018-11-22 20:43     ` Kirti Wankhede
2018-11-23 11:44       ` Dr. David Alan Gilbert
2018-11-23  5:47   ` Zhao Yan
2018-11-27 19:52   ` Alex Williamson
2018-12-04 10:53     ` Cornelia Huck
2018-12-04 17:14       ` Alex Williamson
2018-11-20 20:39 ` [Qemu-devel] [PATCH 2/5] Add save and load functions for VFIO PCI devices Kirti Wankhede
2018-11-21  5:32   ` Peter Xu
2018-11-20 20:39 ` [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices Kirti Wankhede
2018-11-21  7:39   ` Zhao, Yan Y
2018-11-22 21:21     ` Kirti Wankhede
2018-11-23  5:29       ` Zhao Yan
2018-11-22  8:22   ` Zhao, Yan Y
2018-11-22 19:57   ` Dr. David Alan Gilbert
2018-11-29  8:04   ` Zhao Yan
2018-12-17 11:19   ` Gonglei (Arei)
2018-12-19  2:12     ` Zhao Yan
2018-12-21  7:36       ` Zhi Wang
2018-11-20 20:39 ` [Qemu-devel] [PATCH 4/5] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
2018-11-22 20:00   ` Dr. David Alan Gilbert
2018-11-20 20:39 ` [Qemu-devel] [PATCH 5/5] Make vfio-pci device migration capable Kirti Wankhede
2018-11-21  5:47 ` [Qemu-devel] [PATCH 0/5] Add migration support for VFIO device Peter Xu
2018-11-22 21:01   ` Kirti Wankhede

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.