All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/4] Add migration support for VFIO device
@ 2018-10-16 18:12 ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: Kirti Wankhede, qemu-devel, kvm

Add migration support for VFIO device

This Patch set include patches as below:
- Define KABI for VFIO device for migration support.
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO vendor driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried from vendor driver and is mmaped by QEMU. This region is
    used to read/write data from and to vendor driver.
  * .save_live_pending, .save_live_iterate and .is_active_iterate are
    implemented to use QEMU's functionality of iteration during pre-copy
    phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from vendor driver is implemented till pending
    bytes returned by vendor driver are not zero.
  * .save_cleanup and .load_cleanup are implemented to unmap migration
    region that was setup duing setup phase.
  * Added function to get dirty pages bitmap from vendor driver.
- Add vfio_listerner_log_sync to mark dirty pages.
- Make VFIO PCI device migration capable.

Thanks,
Kirti

Kirti Wankhede (4):
  VFIO KABI for migration interface
  Add migration functions for VFIO devices
  Add vfio_listerner_log_sync to mark dirty pages
  Make vfio-pci device migration capable.

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  32 ++
 hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |  13 +-
 include/hw/vfio/vfio-common.h |  23 ++
 linux-headers/linux/vfio.h    |  91 ++++++
 6 files changed, 869 insertions(+), 8 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel] [RFC PATCH v1 0/4] Add migration support for VFIO device
@ 2018-10-16 18:12 ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: qemu-devel, kvm, Kirti Wankhede

Add migration support for VFIO device

This Patch set include patches as below:
- Define KABI for VFIO device for migration support.
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO vendor driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried from vendor driver and is mmaped by QEMU. This region is
    used to read/write data from and to vendor driver.
  * .save_live_pending, .save_live_iterate and .is_active_iterate are
    implemented to use QEMU's functionality of iteration during pre-copy
    phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from vendor driver is implemented till pending
    bytes returned by vendor driver are not zero.
  * .save_cleanup and .load_cleanup are implemented to unmap migration
    region that was setup duing setup phase.
  * Added function to get dirty pages bitmap from vendor driver.
- Add vfio_listerner_log_sync to mark dirty pages.
- Make VFIO PCI device migration capable.

Thanks,
Kirti

Kirti Wankhede (4):
  VFIO KABI for migration interface
  Add migration functions for VFIO devices
  Add vfio_listerner_log_sync to mark dirty pages
  Make vfio-pci device migration capable.

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  32 ++
 hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |  13 +-
 include/hw/vfio/vfio-common.h |  23 ++
 linux-headers/linux/vfio.h    |  91 ++++++
 6 files changed, 869 insertions(+), 8 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 1/4] VFIO KABI for migration interface
  2018-10-16 18:12 ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-16 18:12   ` Kirti Wankhede
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: Kirti Wankhede, qemu-devel, kvm

- Added vfio_device_migration_info structure to use interact with vendor
  driver.
- Different flags are used to get or set migration related information
  from/to vendor driver.
Flag VFIO_MIGRATION_PROBE: To query if feature is supported
Flag VFIO_MIGRATION_GET_REGION: To get migration region info
Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
  from vendor driver
Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
  data to migration region and return number of bytes written in the region
Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
  writes to migration region and communicates it to vendor driver with
  this ioctl with this flag.
Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
  driver from given start address

- Added enum for possible device states.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 3615a269d378..8e9045ed9aa8 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
 
 #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
+ *                              struct vfio_device_migration_info)
+ * Flag VFIO_MIGRATION_PROBE:
+ *      To query if feature is supported
+ *
+ * Flag VFIO_MIGRATION_GET_REGION:
+ *      To get migration region info
+ *      region_index [output] : region index to be used for migration region
+ *      size [output]: size of migration region
+ *
+ * Flag VFIO_MIGRATION_SET_STATE:
+ *      To set device state in vendor driver
+ *      device_state [input] : User space app sends device state to vendor
+ *           driver on state change
+ *
+ * Flag VFIO_MIGRATION_GET_PENDING:
+ *      To get pending bytes yet to be migrated from vendor driver
+ *      threshold_size [Input] : threshold of buffer in User space app.
+ *      pending_precopy_only [output] : pending data which must be migrated in
+ *          precopy phase or in stopped state, in other words - before target
+ *          vm start
+ *      pending_compatible [output] : pending data which may be migrated in any
+ *           phase
+ *      pending_postcopy_only [output] : pending data which must be migrated in
+ *           postcopy phase or in stopped state, in other words - after source
+ *           vm stop
+ *      Sum of pending_precopy_only, pending_compatible and
+ *      pending_postcopy_only is the whole amount of pending data.
+ *
+ * Flag VFIO_MIGRATION_GET_BUFFER:
+ *      On this flag, vendor driver should write data to migration region and
+ *      return number of bytes written in the region.
+ *      bytes_written [output] : number of bytes written in migration buffer by
+ *          vendor driver
+ *
+ * Flag VFIO_MIGRATION_SET_BUFFER
+ *      In migration resume path, user space app writes to migration region and
+ *      communicates it to vendor driver with this ioctl with this flag.
+ *      bytes_written [Input] : number of bytes written in migration buffer by
+ *          user space app.
+ *
+ * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
+ *      Get bitmap of dirty pages from vendor driver from given start address.
+ *      start_addr [Input] : start address
+ *      pfn_count [Input] : Total pfn count from start_addr for which dirty
+ *          bitmap is requested
+ *      dirty_bitmap [Output] : bitmap memory allocated by user space
+ *           application, vendor driver should return the bitmap with bits set
+ *           only for pages to be marked dirty.
+ * Return: 0 on success, -errno on failure.
+ */
+
+struct vfio_device_migration_info {
+	__u32 argsz;
+	__u32 flags;
+#define VFIO_MIGRATION_PROBE            (1 << 0)
+#define VFIO_MIGRATION_GET_REGION       (1 << 1)
+#define VFIO_MIGRATION_SET_STATE        (1 << 2)
+#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
+#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
+#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
+#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
+        __u32 region_index;	            /* region index */
+        __u64 size;	                    /* size */
+        __u32 device_state;                 /* VFIO device state */
+        __u64 pending_precopy_only;
+        __u64 pending_compatible;
+        __u64 pending_postcopy_only;
+        __u64 threshold_size;
+        __u64 bytes_written;
+        __u64 start_addr;
+        __u64 pfn_count;
+	__u8  dirty_bitmap[];
+};
+
+enum {
+    VFIO_DEVICE_STATE_NONE,
+    VFIO_DEVICE_STATE_RUNNING,
+    VFIO_DEVICE_STATE_MIGRATION_SETUP,
+    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
+    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
+    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_FAILED,
+    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
+};
+
+#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface
@ 2018-10-16 18:12   ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: qemu-devel, kvm, Kirti Wankhede

- Added vfio_device_migration_info structure to use interact with vendor
  driver.
- Different flags are used to get or set migration related information
  from/to vendor driver.
Flag VFIO_MIGRATION_PROBE: To query if feature is supported
Flag VFIO_MIGRATION_GET_REGION: To get migration region info
Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
  from vendor driver
Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
  data to migration region and return number of bytes written in the region
Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
  writes to migration region and communicates it to vendor driver with
  this ioctl with this flag.
Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
  driver from given start address

- Added enum for possible device states.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 3615a269d378..8e9045ed9aa8 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
 
 #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
+ *                              struct vfio_device_migration_info)
+ * Flag VFIO_MIGRATION_PROBE:
+ *      To query if feature is supported
+ *
+ * Flag VFIO_MIGRATION_GET_REGION:
+ *      To get migration region info
+ *      region_index [output] : region index to be used for migration region
+ *      size [output]: size of migration region
+ *
+ * Flag VFIO_MIGRATION_SET_STATE:
+ *      To set device state in vendor driver
+ *      device_state [input] : User space app sends device state to vendor
+ *           driver on state change
+ *
+ * Flag VFIO_MIGRATION_GET_PENDING:
+ *      To get pending bytes yet to be migrated from vendor driver
+ *      threshold_size [Input] : threshold of buffer in User space app.
+ *      pending_precopy_only [output] : pending data which must be migrated in
+ *          precopy phase or in stopped state, in other words - before target
+ *          vm start
+ *      pending_compatible [output] : pending data which may be migrated in any
+ *           phase
+ *      pending_postcopy_only [output] : pending data which must be migrated in
+ *           postcopy phase or in stopped state, in other words - after source
+ *           vm stop
+ *      Sum of pending_precopy_only, pending_compatible and
+ *      pending_postcopy_only is the whole amount of pending data.
+ *
+ * Flag VFIO_MIGRATION_GET_BUFFER:
+ *      On this flag, vendor driver should write data to migration region and
+ *      return number of bytes written in the region.
+ *      bytes_written [output] : number of bytes written in migration buffer by
+ *          vendor driver
+ *
+ * Flag VFIO_MIGRATION_SET_BUFFER
+ *      In migration resume path, user space app writes to migration region and
+ *      communicates it to vendor driver with this ioctl with this flag.
+ *      bytes_written [Input] : number of bytes written in migration buffer by
+ *          user space app.
+ *
+ * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
+ *      Get bitmap of dirty pages from vendor driver from given start address.
+ *      start_addr [Input] : start address
+ *      pfn_count [Input] : Total pfn count from start_addr for which dirty
+ *          bitmap is requested
+ *      dirty_bitmap [Output] : bitmap memory allocated by user space
+ *           application, vendor driver should return the bitmap with bits set
+ *           only for pages to be marked dirty.
+ * Return: 0 on success, -errno on failure.
+ */
+
+struct vfio_device_migration_info {
+	__u32 argsz;
+	__u32 flags;
+#define VFIO_MIGRATION_PROBE            (1 << 0)
+#define VFIO_MIGRATION_GET_REGION       (1 << 1)
+#define VFIO_MIGRATION_SET_STATE        (1 << 2)
+#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
+#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
+#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
+#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
+        __u32 region_index;	            /* region index */
+        __u64 size;	                    /* size */
+        __u32 device_state;                 /* VFIO device state */
+        __u64 pending_precopy_only;
+        __u64 pending_compatible;
+        __u64 pending_postcopy_only;
+        __u64 threshold_size;
+        __u64 bytes_written;
+        __u64 start_addr;
+        __u64 pfn_count;
+	__u8  dirty_bitmap[];
+};
+
+enum {
+    VFIO_DEVICE_STATE_NONE,
+    VFIO_DEVICE_STATE_RUNNING,
+    VFIO_DEVICE_STATE_MIGRATION_SETUP,
+    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
+    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
+    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_FAILED,
+    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
+};
+
+#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 2/4] Add migration functions for VFIO devices
  2018-10-16 18:12 ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-16 18:12   ` Kirti Wankhede
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: Kirti Wankhede, qemu-devel, kvm

- Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
- Added SaveVMHandlers and implemented all basic functions required for live
  migration.
- Added VM state change handler to know running or stopped state of VM.
- Added migration state change notifier to get notification on migration state
  change. This state is translated to VFIO device state and conveyed to vendor
  driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  23 ++
 3 files changed, 740 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index a2e7a0a7cf02..6206ad47e90e 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
-obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
+obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..8a4f515226e0
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,716 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2018
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.buffer.size) {
+        vfio_region_exit(&migration->region.buffer);
+        vfio_region_finalize(&migration->region.buffer);
+    }
+    g_free(vbasedev->migration);
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration;
+    Object *obj = NULL;
+    int ret;
+    struct vfio_device_migration_info migration_info = {
+        .argsz = sizeof(migration_info),
+        .flags = VFIO_MIGRATION_GET_REGION,
+    };
+
+    /* Migration support added for PCI device only */
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+        obj = OBJECT(vdev);
+    } else
+        return -EINVAL;
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &migration_info);
+    if (ret < 0) {
+        error_report("Failed to migration region %s",
+                     strerror(errno));
+        return ret;
+    }
+
+    if (!migration_info.size || !migration_info.region_index) {
+        error_report("Incorrect migration region params index: %d,size: 0x%llx",
+                     migration_info.region_index, migration_info.size);
+        return -EINVAL;
+    }
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+    migration = vbasedev->migration;
+
+    migration->region.index = migration_info.region_index;
+
+    ret = vfio_region_setup(obj, vbasedev,
+                            &migration->region.buffer,
+                            migration_info.region_index,
+                            "migration");
+    if (ret != 0) {
+        error_report("%s: vfio_region_setup(%d): %s",
+                __func__, migration_info.region_index, strerror(-ret));
+        goto err;
+    }
+
+    if (migration->region.buffer.mmaps == NULL) {
+        ret = -EINVAL;
+        error_report("%s: Migration region (%d) not mappable : %s",
+                __func__, migration_info.region_index, strerror(-ret));
+        goto err;
+    }
+
+    ret = vfio_region_mmap(&migration->region.buffer);
+    if (ret != 0) {
+        error_report("%s: vfio_region_mmap(%d): %s", __func__,
+                migration_info.region_index, strerror(-ret));
+        goto err;
+    }
+    assert(migration->region.buffer.mmaps[0].mmap != NULL);
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
+{
+    int ret = 0;
+    struct vfio_device_migration_info migration_info = {
+        .argsz = sizeof(migration_info),
+        .flags = VFIO_MIGRATION_SET_STATE,
+        .device_state = state,
+    };
+
+    if (vbasedev->device_state == state) {
+        return ret;
+    }
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &migration_info);
+    if (ret < 0) {
+        error_report("Failed to set migration state %d %s",
+                     ret, strerror(errno));
+        return ret;
+    }
+
+    vbasedev->device_state = state;
+    return ret;
+}
+
+void vfio_get_dirty_page_list(VFIODevice *vbasedev,
+                              uint64_t start_addr,
+                              uint64_t pfn_count)
+{
+    uint64_t count = 0;
+    int ret;
+    struct vfio_device_migration_info *migration_info;
+    uint64_t bitmap_size;
+
+    bitmap_size = (BITS_TO_LONGS(pfn_count) + 1) * sizeof(unsigned long);
+
+    migration_info = g_malloc0(sizeof(*migration_info) +  bitmap_size);
+    if (!migration_info) {
+        error_report("Failed to allocated migration_info %s",
+                     strerror(errno));
+        return;
+    }
+
+    memset(migration_info, 0, sizeof(*migration_info) +  bitmap_size);
+    migration_info->flags = VFIO_MIGRATION_GET_DIRTY_PFNS,
+    migration_info->start_addr = start_addr;
+    migration_info->pfn_count = pfn_count;
+    migration_info->argsz = sizeof(*migration_info) + bitmap_size;
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, migration_info);
+    if (ret < 0) {
+        error_report("Failed to get dirty pages bitmap %d %s",
+                ret, strerror(errno));
+        g_free(migration_info);
+        return;
+    }
+
+    if (migration_info->pfn_count) {
+        cpu_physical_memory_set_dirty_lebitmap(
+                (unsigned long *)&migration_info->dirty_bitmap,
+                migration_info->start_addr, migration_info->pfn_count);
+        count +=  migration_info->pfn_count;
+    }
+    g_free(migration_info);
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+        PCIDevice *pdev = &vdev->pdev;
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+        int i;
+
+        for (i = 0; i < PCI_ROM_SLOT; i++) {
+            uint32_t bar;
+
+            bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+            qemu_put_be32(f, bar);
+        }
+
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        qemu_put_be32(f, msi_addr_lo);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                            4);
+        }
+        qemu_put_be32(f, msi_addr_hi);
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+        qemu_put_be32(f, msi_data);
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    return qemu_file_get_error(f);
+}
+
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+        PCIDevice *pdev = &vdev->pdev;
+        uint32_t pci_cmd;
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+        int i;
+
+        /* retore pci bar configuration */
+        pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+        vfio_pci_write_config(pdev, PCI_COMMAND,
+                         pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+        for (i = 0; i < PCI_ROM_SLOT; i++) {
+            uint32_t bar = qemu_get_be32(f);
+
+            vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+        }
+        vfio_pci_write_config(pdev, PCI_COMMAND,
+                              pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
+
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS,
+                                            2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(&vdev->pdev,
+                              pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE),
+                              2);
+
+        msi_addr_lo = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                              pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo,
+                              4);
+
+        msi_addr_hi = qemu_get_be32(f);
+        if (msi_64bit) {
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+        msi_data = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                              pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 :
+                                                           PCI_MSI_DATA_32),
+                              msi_data,
+                              2);
+
+        vfio_pci_write_config(&vdev->pdev,
+                              pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE,
+                              2);
+    }
+
+    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("%s Wrong end of block ", __func__);
+        return -EINVAL;
+    }
+
+    return qemu_file_get_error(f);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vbasedev->vm_running && vbasedev->migration &&
+        (vbasedev->migration->pending_precopy_only != 0))
+        return true;
+
+    if (!vbasedev->vm_running && vbasedev->migration &&
+        (vbasedev->migration->pending_postcopy != 0))
+        return true;
+
+    return false;
+}
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    qemu_mutex_lock_iothread();
+    ret = vfio_migration_region_init(vbasedev);
+    qemu_mutex_unlock_iothread();
+    if (ret) {
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    uint8_t *buf = (uint8_t *)migration->region.buffer.mmaps[0].mmap;
+    int ret;
+    struct vfio_device_migration_info migration_info = {
+        .argsz = sizeof(migration_info),
+        .flags = VFIO_MIGRATION_GET_BUFFER,
+    };
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &migration_info);
+    if (ret < 0) {
+        error_report("Failed to get migration buffer information %s",
+                     strerror(errno));
+        return ret;
+    }
+
+    qemu_put_be64(f, migration_info.bytes_written);
+
+    if (migration_info.bytes_written) {
+        qemu_put_buffer(f, buf, migration_info.bytes_written);
+    }
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return migration_info.bytes_written;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    ret = vfio_save_buffer(f, vbasedev);
+    if (ret < 0) {
+        error_report("vfio_save_buffer failed %s",
+                     strerror(errno));
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return ret;
+}
+
+static void vfio_update_pending(VFIODevice *vbasedev, uint64_t threshold_size)
+{
+    struct vfio_device_migration_info migration_info;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    migration_info.argsz = sizeof(migration_info);
+    migration_info.flags = VFIO_MIGRATION_GET_PENDING;
+    migration_info.threshold_size = threshold_size;
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &migration_info);
+    if (ret < 0) {
+        error_report("Failed to get pending bytes %s",
+                     strerror(errno));
+        return;
+    }
+
+    migration->pending_precopy_only = migration_info.pending_precopy_only;
+    migration->pending_compatible = migration_info.pending_compatible;
+    migration->pending_postcopy = migration_info.pending_postcopy_only;
+
+    return;
+}
+
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    vfio_update_pending(vbasedev, threshold_size);
+
+    *res_precopy_only += migration->pending_precopy_only;
+    *res_compatible += migration->pending_compatible;
+    *res_postcopy_only += migration->pending_postcopy;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    MigrationState *ms = migrate_get_current();
+    int ret;
+
+    if (vbasedev->vm_running) {
+        vbasedev->vm_running = 0;
+    }
+
+    ret = vfio_migration_set_state(vbasedev,
+                                 VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE);
+    if (ret) {
+        error_report("Failed to set state STOPNCOPY_ACTIVE");
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    do {
+        vfio_update_pending(vbasedev, ms->threshold_size);
+
+        if (vfio_is_active_iterate(opaque)) {
+            ret = vfio_save_buffer(f, vbasedev);
+            if (ret < 0) {
+                error_report("Failed to save buffer");
+                break;
+            } else if (ret == 0) {
+                break;
+            }
+        }
+    } while ((migration->pending_compatible + migration->pending_postcopy) > 0);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev,
+                                   VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED);
+    if (ret) {
+        error_report("Failed to set state SAVE_COMPLETED");
+        return ret;
+    }
+    return ret;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    vfio_migration_region_exit(vbasedev);
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    uint8_t *buf = (uint8_t *)migration->region.buffer.mmaps[0].mmap;
+    int ret;
+    uint64_t data;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return 0;
+            } else {
+                error_report("SETUP STATE: EOS not found 0x%lx", data);
+                return -EINVAL;
+            }
+        } else if (data != 0) {
+            struct vfio_device_migration_info migration_info = {
+                .argsz = sizeof(migration_info),
+                .flags = VFIO_MIGRATION_SET_BUFFER,
+            };
+
+            qemu_get_buffer(f, buf, data);
+            migration_info.bytes_written = data;
+
+            ret = ioctl(vbasedev->fd,
+                        VFIO_DEVICE_MIGRATION_INFO,
+                        &migration_info);
+            if (ret < 0) {
+                error_report("Failed to set migration buffer information %s",
+                              strerror(errno));
+                return ret;
+            }
+        }
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+        data = qemu_get_be64(f);
+    }
+
+    return 0;
+}
+
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev,
+                                    VFIO_DEVICE_STATE_MIGRATION_RESUME);
+    if (ret) {
+        error_report("Failed to set state RESUME");
+    }
+
+    ret = vfio_migration_region_init(vbasedev);
+    if (ret) {
+        error_report("Failed to initialise migration region");
+        return ret;
+    }
+
+    return 0;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret = 0;
+
+    ret = vfio_migration_set_state(vbasedev,
+                                 VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED);
+    if (ret) {
+        error_report("Failed to set state RESUME_COMPLETED");
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_live_pending = vfio_save_pending,
+    .save_cleanup = vfio_save_cleanup,
+    .load_state = vfio_load_state,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .is_active_iterate = vfio_is_active_iterate,
+};
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running) && running) {
+        int ret;
+
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("Failed to set state RUNNING");
+        }
+    }
+
+    vbasedev->vm_running = running;
+}
+
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    switch (s->state) {
+    case MIGRATION_STATUS_SETUP:
+        ret = vfio_migration_set_state(vbasedev,
+                                       VFIO_DEVICE_STATE_MIGRATION_SETUP);
+        if (ret) {
+            error_report("Failed to set state SETUP");
+        }
+        return;
+
+    case MIGRATION_STATUS_ACTIVE:
+        if (vbasedev->device_state == VFIO_DEVICE_STATE_MIGRATION_SETUP) {
+            if (vbasedev->vm_running) {
+                ret = vfio_migration_set_state(vbasedev,
+                                    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE);
+                if (ret) {
+                    error_report("Failed to set state PRECOPY_ACTIVE");
+                }
+            } else {
+                ret = vfio_migration_set_state(vbasedev,
+                                 VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE);
+                if (ret) {
+                    error_report("Failed to set state STOPNCOPY_ACTIVE");
+                }
+            }
+        } else {
+            ret = vfio_migration_set_state(vbasedev,
+                                           VFIO_DEVICE_STATE_MIGRATION_RESUME);
+            if (ret) {
+                error_report("Failed to set state RESUME");
+            }
+        }
+        return;
+
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+        ret = vfio_migration_set_state(vbasedev,
+                                       VFIO_DEVICE_STATE_MIGRATION_CANCELLED);
+        if (ret) {
+            error_report("Failed to set state CANCELLED");
+        }
+        return;
+
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev,
+                                       VFIO_DEVICE_STATE_MIGRATION_FAILED);
+        if (ret) {
+            error_report("Failed to set state FAILED");
+        }
+        return;
+    }
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev)
+{
+    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
+
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
+
+    return 0;
+}
+
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_device_migration_info  probe;
+    Error *local_err = NULL;
+    int ret;
+
+    memset(&probe, 0, sizeof(probe));
+    probe.argsz = sizeof(probe);
+    probe.flags = VFIO_MIGRATION_PROBE;
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &probe);
+
+    if (ret == 0) {
+        return vfio_migration_init(vbasedev);
+    }
+
+    error_setg(&vbasedev->migration_blocker,
+               "VFIO device doesn't support migration");
+    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vbasedev->migration_blocker);
+        return ret;
+    }
+
+    return 0;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+    }
+}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a9036929b220..ab8217c9e249 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -30,6 +30,8 @@
 #include <linux/vfio.h>
 #endif
 
+#include "sysemu/sysemu.h"
+
 #define ERR_PREFIX "vfio error: %s: "
 #define WARN_PREFIX "vfio warning: %s: "
 
@@ -57,6 +59,16 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    struct {
+        VFIORegion buffer;
+        uint32_t index;
+    } region;
+    uint64_t pending_precopy_only;
+    uint64_t pending_compatible;
+    uint64_t pending_postcopy;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -116,6 +128,12 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    uint32_t device_state;
+    VMChangeStateEntry *vm_state;
+    int vm_running;
+    Notifier migration_state;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -193,4 +211,9 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_addr,
+                               uint64_t pfn_count);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [RFC PATCH v1 2/4] Add migration functions for VFIO devices
@ 2018-10-16 18:12   ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: qemu-devel, kvm, Kirti Wankhede

- Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
- Added SaveVMHandlers and implemented all basic functions required for live
  migration.
- Added VM state change handler to know running or stopped state of VM.
- Added migration state change notifier to get notification on migration state
  change. This state is translated to VFIO device state and conveyed to vendor
  driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  23 ++
 3 files changed, 740 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index a2e7a0a7cf02..6206ad47e90e 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
-obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
+obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..8a4f515226e0
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,716 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2018
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.buffer.size) {
+        vfio_region_exit(&migration->region.buffer);
+        vfio_region_finalize(&migration->region.buffer);
+    }
+    g_free(vbasedev->migration);
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration;
+    Object *obj = NULL;
+    int ret;
+    struct vfio_device_migration_info migration_info = {
+        .argsz = sizeof(migration_info),
+        .flags = VFIO_MIGRATION_GET_REGION,
+    };
+
+    /* Migration support added for PCI device only */
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+        obj = OBJECT(vdev);
+    } else
+        return -EINVAL;
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &migration_info);
+    if (ret < 0) {
+        error_report("Failed to migration region %s",
+                     strerror(errno));
+        return ret;
+    }
+
+    if (!migration_info.size || !migration_info.region_index) {
+        error_report("Incorrect migration region params index: %d,size: 0x%llx",
+                     migration_info.region_index, migration_info.size);
+        return -EINVAL;
+    }
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+    migration = vbasedev->migration;
+
+    migration->region.index = migration_info.region_index;
+
+    ret = vfio_region_setup(obj, vbasedev,
+                            &migration->region.buffer,
+                            migration_info.region_index,
+                            "migration");
+    if (ret != 0) {
+        error_report("%s: vfio_region_setup(%d): %s",
+                __func__, migration_info.region_index, strerror(-ret));
+        goto err;
+    }
+
+    if (migration->region.buffer.mmaps == NULL) {
+        ret = -EINVAL;
+        error_report("%s: Migration region (%d) not mappable : %s",
+                __func__, migration_info.region_index, strerror(-ret));
+        goto err;
+    }
+
+    ret = vfio_region_mmap(&migration->region.buffer);
+    if (ret != 0) {
+        error_report("%s: vfio_region_mmap(%d): %s", __func__,
+                migration_info.region_index, strerror(-ret));
+        goto err;
+    }
+    assert(migration->region.buffer.mmaps[0].mmap != NULL);
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
+{
+    int ret = 0;
+    struct vfio_device_migration_info migration_info = {
+        .argsz = sizeof(migration_info),
+        .flags = VFIO_MIGRATION_SET_STATE,
+        .device_state = state,
+    };
+
+    if (vbasedev->device_state == state) {
+        return ret;
+    }
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &migration_info);
+    if (ret < 0) {
+        error_report("Failed to set migration state %d %s",
+                     ret, strerror(errno));
+        return ret;
+    }
+
+    vbasedev->device_state = state;
+    return ret;
+}
+
+void vfio_get_dirty_page_list(VFIODevice *vbasedev,
+                              uint64_t start_addr,
+                              uint64_t pfn_count)
+{
+    uint64_t count = 0;
+    int ret;
+    struct vfio_device_migration_info *migration_info;
+    uint64_t bitmap_size;
+
+    bitmap_size = (BITS_TO_LONGS(pfn_count) + 1) * sizeof(unsigned long);
+
+    migration_info = g_malloc0(sizeof(*migration_info) +  bitmap_size);
+    if (!migration_info) {
+        error_report("Failed to allocated migration_info %s",
+                     strerror(errno));
+        return;
+    }
+
+    memset(migration_info, 0, sizeof(*migration_info) +  bitmap_size);
+    migration_info->flags = VFIO_MIGRATION_GET_DIRTY_PFNS,
+    migration_info->start_addr = start_addr;
+    migration_info->pfn_count = pfn_count;
+    migration_info->argsz = sizeof(*migration_info) + bitmap_size;
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, migration_info);
+    if (ret < 0) {
+        error_report("Failed to get dirty pages bitmap %d %s",
+                ret, strerror(errno));
+        g_free(migration_info);
+        return;
+    }
+
+    if (migration_info->pfn_count) {
+        cpu_physical_memory_set_dirty_lebitmap(
+                (unsigned long *)&migration_info->dirty_bitmap,
+                migration_info->start_addr, migration_info->pfn_count);
+        count +=  migration_info->pfn_count;
+    }
+    g_free(migration_info);
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+        PCIDevice *pdev = &vdev->pdev;
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+        int i;
+
+        for (i = 0; i < PCI_ROM_SLOT; i++) {
+            uint32_t bar;
+
+            bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+            qemu_put_be32(f, bar);
+        }
+
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        qemu_put_be32(f, msi_addr_lo);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                            4);
+        }
+        qemu_put_be32(f, msi_addr_hi);
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+        qemu_put_be32(f, msi_data);
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    return qemu_file_get_error(f);
+}
+
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+        PCIDevice *pdev = &vdev->pdev;
+        uint32_t pci_cmd;
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+        int i;
+
+        /* retore pci bar configuration */
+        pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+        vfio_pci_write_config(pdev, PCI_COMMAND,
+                         pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+        for (i = 0; i < PCI_ROM_SLOT; i++) {
+            uint32_t bar = qemu_get_be32(f);
+
+            vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+        }
+        vfio_pci_write_config(pdev, PCI_COMMAND,
+                              pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
+
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS,
+                                            2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(&vdev->pdev,
+                              pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE),
+                              2);
+
+        msi_addr_lo = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                              pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo,
+                              4);
+
+        msi_addr_hi = qemu_get_be32(f);
+        if (msi_64bit) {
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+        msi_data = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                              pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 :
+                                                           PCI_MSI_DATA_32),
+                              msi_data,
+                              2);
+
+        vfio_pci_write_config(&vdev->pdev,
+                              pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE,
+                              2);
+    }
+
+    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("%s Wrong end of block ", __func__);
+        return -EINVAL;
+    }
+
+    return qemu_file_get_error(f);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vbasedev->vm_running && vbasedev->migration &&
+        (vbasedev->migration->pending_precopy_only != 0))
+        return true;
+
+    if (!vbasedev->vm_running && vbasedev->migration &&
+        (vbasedev->migration->pending_postcopy != 0))
+        return true;
+
+    return false;
+}
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    qemu_mutex_lock_iothread();
+    ret = vfio_migration_region_init(vbasedev);
+    qemu_mutex_unlock_iothread();
+    if (ret) {
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    uint8_t *buf = (uint8_t *)migration->region.buffer.mmaps[0].mmap;
+    int ret;
+    struct vfio_device_migration_info migration_info = {
+        .argsz = sizeof(migration_info),
+        .flags = VFIO_MIGRATION_GET_BUFFER,
+    };
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &migration_info);
+    if (ret < 0) {
+        error_report("Failed to get migration buffer information %s",
+                     strerror(errno));
+        return ret;
+    }
+
+    qemu_put_be64(f, migration_info.bytes_written);
+
+    if (migration_info.bytes_written) {
+        qemu_put_buffer(f, buf, migration_info.bytes_written);
+    }
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return migration_info.bytes_written;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    ret = vfio_save_buffer(f, vbasedev);
+    if (ret < 0) {
+        error_report("vfio_save_buffer failed %s",
+                     strerror(errno));
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return ret;
+}
+
+static void vfio_update_pending(VFIODevice *vbasedev, uint64_t threshold_size)
+{
+    struct vfio_device_migration_info migration_info;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    migration_info.argsz = sizeof(migration_info);
+    migration_info.flags = VFIO_MIGRATION_GET_PENDING;
+    migration_info.threshold_size = threshold_size;
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &migration_info);
+    if (ret < 0) {
+        error_report("Failed to get pending bytes %s",
+                     strerror(errno));
+        return;
+    }
+
+    migration->pending_precopy_only = migration_info.pending_precopy_only;
+    migration->pending_compatible = migration_info.pending_compatible;
+    migration->pending_postcopy = migration_info.pending_postcopy_only;
+
+    return;
+}
+
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    vfio_update_pending(vbasedev, threshold_size);
+
+    *res_precopy_only += migration->pending_precopy_only;
+    *res_compatible += migration->pending_compatible;
+    *res_postcopy_only += migration->pending_postcopy;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    MigrationState *ms = migrate_get_current();
+    int ret;
+
+    if (vbasedev->vm_running) {
+        vbasedev->vm_running = 0;
+    }
+
+    ret = vfio_migration_set_state(vbasedev,
+                                 VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE);
+    if (ret) {
+        error_report("Failed to set state STOPNCOPY_ACTIVE");
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    do {
+        vfio_update_pending(vbasedev, ms->threshold_size);
+
+        if (vfio_is_active_iterate(opaque)) {
+            ret = vfio_save_buffer(f, vbasedev);
+            if (ret < 0) {
+                error_report("Failed to save buffer");
+                break;
+            } else if (ret == 0) {
+                break;
+            }
+        }
+    } while ((migration->pending_compatible + migration->pending_postcopy) > 0);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev,
+                                   VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED);
+    if (ret) {
+        error_report("Failed to set state SAVE_COMPLETED");
+        return ret;
+    }
+    return ret;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    vfio_migration_region_exit(vbasedev);
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    uint8_t *buf = (uint8_t *)migration->region.buffer.mmaps[0].mmap;
+    int ret;
+    uint64_t data;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return 0;
+            } else {
+                error_report("SETUP STATE: EOS not found 0x%lx", data);
+                return -EINVAL;
+            }
+        } else if (data != 0) {
+            struct vfio_device_migration_info migration_info = {
+                .argsz = sizeof(migration_info),
+                .flags = VFIO_MIGRATION_SET_BUFFER,
+            };
+
+            qemu_get_buffer(f, buf, data);
+            migration_info.bytes_written = data;
+
+            ret = ioctl(vbasedev->fd,
+                        VFIO_DEVICE_MIGRATION_INFO,
+                        &migration_info);
+            if (ret < 0) {
+                error_report("Failed to set migration buffer information %s",
+                              strerror(errno));
+                return ret;
+            }
+        }
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+        data = qemu_get_be64(f);
+    }
+
+    return 0;
+}
+
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev,
+                                    VFIO_DEVICE_STATE_MIGRATION_RESUME);
+    if (ret) {
+        error_report("Failed to set state RESUME");
+    }
+
+    ret = vfio_migration_region_init(vbasedev);
+    if (ret) {
+        error_report("Failed to initialise migration region");
+        return ret;
+    }
+
+    return 0;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret = 0;
+
+    ret = vfio_migration_set_state(vbasedev,
+                                 VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED);
+    if (ret) {
+        error_report("Failed to set state RESUME_COMPLETED");
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_live_pending = vfio_save_pending,
+    .save_cleanup = vfio_save_cleanup,
+    .load_state = vfio_load_state,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .is_active_iterate = vfio_is_active_iterate,
+};
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running) && running) {
+        int ret;
+
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("Failed to set state RUNNING");
+        }
+    }
+
+    vbasedev->vm_running = running;
+}
+
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    switch (s->state) {
+    case MIGRATION_STATUS_SETUP:
+        ret = vfio_migration_set_state(vbasedev,
+                                       VFIO_DEVICE_STATE_MIGRATION_SETUP);
+        if (ret) {
+            error_report("Failed to set state SETUP");
+        }
+        return;
+
+    case MIGRATION_STATUS_ACTIVE:
+        if (vbasedev->device_state == VFIO_DEVICE_STATE_MIGRATION_SETUP) {
+            if (vbasedev->vm_running) {
+                ret = vfio_migration_set_state(vbasedev,
+                                    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE);
+                if (ret) {
+                    error_report("Failed to set state PRECOPY_ACTIVE");
+                }
+            } else {
+                ret = vfio_migration_set_state(vbasedev,
+                                 VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE);
+                if (ret) {
+                    error_report("Failed to set state STOPNCOPY_ACTIVE");
+                }
+            }
+        } else {
+            ret = vfio_migration_set_state(vbasedev,
+                                           VFIO_DEVICE_STATE_MIGRATION_RESUME);
+            if (ret) {
+                error_report("Failed to set state RESUME");
+            }
+        }
+        return;
+
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+        ret = vfio_migration_set_state(vbasedev,
+                                       VFIO_DEVICE_STATE_MIGRATION_CANCELLED);
+        if (ret) {
+            error_report("Failed to set state CANCELLED");
+        }
+        return;
+
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev,
+                                       VFIO_DEVICE_STATE_MIGRATION_FAILED);
+        if (ret) {
+            error_report("Failed to set state FAILED");
+        }
+        return;
+    }
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev)
+{
+    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
+
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
+
+    return 0;
+}
+
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_device_migration_info  probe;
+    Error *local_err = NULL;
+    int ret;
+
+    memset(&probe, 0, sizeof(probe));
+    probe.argsz = sizeof(probe);
+    probe.flags = VFIO_MIGRATION_PROBE;
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_MIGRATION_INFO, &probe);
+
+    if (ret == 0) {
+        return vfio_migration_init(vbasedev);
+    }
+
+    error_setg(&vbasedev->migration_blocker,
+               "VFIO device doesn't support migration");
+    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vbasedev->migration_blocker);
+        return ret;
+    }
+
+    return 0;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+    }
+}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a9036929b220..ab8217c9e249 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -30,6 +30,8 @@
 #include <linux/vfio.h>
 #endif
 
+#include "sysemu/sysemu.h"
+
 #define ERR_PREFIX "vfio error: %s: "
 #define WARN_PREFIX "vfio warning: %s: "
 
@@ -57,6 +59,16 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    struct {
+        VFIORegion buffer;
+        uint32_t index;
+    } region;
+    uint64_t pending_precopy_only;
+    uint64_t pending_compatible;
+    uint64_t pending_postcopy;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -116,6 +128,12 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    uint32_t device_state;
+    VMChangeStateEntry *vm_state;
+    int vm_running;
+    Notifier migration_state;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -193,4 +211,9 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_addr,
+                               uint64_t pfn_count);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 3/4] Add vfio_listerner_log_sync to mark dirty pages
  2018-10-16 18:12 ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-16 18:12   ` Kirti Wankhede
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: Kirti Wankhede, qemu-devel, kvm

vfio_listerner_log_sync gets list of dirty pages from vendor driver and mark
those pages dirty.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb396cf00ac4..817d93750337 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -697,9 +697,41 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_listerner_log_sync(MemoryListener *listener,
+                                    MemoryRegionSection *section)
+{
+    uint64_t start_addr, size, pfn_count;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            switch (vbasedev->device_state) {
+            case VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE:
+            case VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE:
+                    continue;
+
+            default:
+                    return;
+            }
+        }
+    }
+
+    start_addr = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    size = int128_get64(section->size);
+    pfn_count = size >> TARGET_PAGE_BITS;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vfio_get_dirty_page_list(vbasedev, start_addr, pfn_count);
+        }
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [RFC PATCH v1 3/4] Add vfio_listerner_log_sync to mark dirty pages
@ 2018-10-16 18:12   ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: qemu-devel, kvm, Kirti Wankhede

vfio_listerner_log_sync gets list of dirty pages from vendor driver and mark
those pages dirty.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb396cf00ac4..817d93750337 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -697,9 +697,41 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_listerner_log_sync(MemoryListener *listener,
+                                    MemoryRegionSection *section)
+{
+    uint64_t start_addr, size, pfn_count;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            switch (vbasedev->device_state) {
+            case VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE:
+            case VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE:
+                    continue;
+
+            default:
+                    return;
+            }
+        }
+    }
+
+    start_addr = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    size = int128_get64(section->size);
+    pfn_count = size >> TARGET_PAGE_BITS;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vfio_get_dirty_page_list(vbasedev, start_addr, pfn_count);
+        }
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 4/4] Make vfio-pci device migration capable.
  2018-10-16 18:12 ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-16 18:12   ` Kirti Wankhede
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: Kirti Wankhede, qemu-devel, kvm

Call vfio_migration_probe() and vfio_migration_finalize() functions for
vfio-pci device to enable migration for vfio PCI device.
Removed vfio_pci_vmstate structure.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6cbb8fa0549d..dd833f3f1830 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2835,6 +2835,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = &vdev->pdev.qdev;
+    vdev->vbasedev.device_state = VFIO_DEVICE_STATE_NONE;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3046,10 +3047,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    ret = vfio_migration_probe(&vdev->vbasedev, errp);
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
-
     return;
 
 out_teardown:
@@ -3085,6 +3087,8 @@ static void vfio_exitfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
 
+    vdev->vbasedev.device_state = VFIO_DEVICE_STATE_NONE;
+
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
@@ -3094,6 +3098,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
@@ -3199,11 +3204,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3211,7 +3211,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [RFC PATCH v1 4/4] Make vfio-pci device migration capable.
@ 2018-10-16 18:12   ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-16 18:12 UTC (permalink / raw)
  To: alex.williamson, cjia; +Cc: qemu-devel, kvm, Kirti Wankhede

Call vfio_migration_probe() and vfio_migration_finalize() functions for
vfio-pci device to enable migration for vfio PCI device.
Removed vfio_pci_vmstate structure.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6cbb8fa0549d..dd833f3f1830 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2835,6 +2835,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = &vdev->pdev.qdev;
+    vdev->vbasedev.device_state = VFIO_DEVICE_STATE_NONE;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3046,10 +3047,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    ret = vfio_migration_probe(&vdev->vbasedev, errp);
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
-
     return;
 
 out_teardown:
@@ -3085,6 +3087,8 @@ static void vfio_exitfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
 
+    vdev->vbasedev.device_state = VFIO_DEVICE_STATE_NONE;
+
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
@@ -3094,6 +3098,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
@@ -3199,11 +3204,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3211,7 +3211,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
  2018-10-16 18:12   ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-16 22:34     ` Alex Williamson
  -1 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2018-10-16 22:34 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kvm, Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Yulei Zhang, Wang, Zhi A

On Tue, 16 Oct 2018 23:42:35 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Added vfio_device_migration_info structure to use interact with vendor
>   driver.
> - Different flags are used to get or set migration related information
>   from/to vendor driver.
> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
>   from vendor driver
> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
>   data to migration region and return number of bytes written in the region
> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
>   writes to migration region and communicates it to vendor driver with
>   this ioctl with this flag.
> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
>   driver from given start address
> 
> - Added enum for possible device states.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 91 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..8e9045ed9aa8 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
>  
>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/**
> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> + *                              struct vfio_device_migration_info)

This is quite a bit more than an "INFO" ioctl.

> + * Flag VFIO_MIGRATION_PROBE:
> + *      To query if feature is supported
> + *
> + * Flag VFIO_MIGRATION_GET_REGION:
> + *      To get migration region info
> + *      region_index [output] : region index to be used for migration region
> + *      size [output]: size of migration region

Of course the region migration region can describe itself as being used
for migration, so this is unnecessary.  The presence of that region
could also negate the need for a probe.

> + *
> + * Flag VFIO_MIGRATION_SET_STATE:
> + *      To set device state in vendor driver
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change

Valid states are the enum defined below, correct?

Does setting STOPNCOPY_ACTIVE stop any state change of the device or is
that expected to happen through other means?

What are the allowable state transitions?

How many bits in flags is a user allowed to set at once?

> + * Flag VFIO_MIGRATION_GET_PENDING:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      threshold_size [Input] : threshold of buffer in User space app.
> + *      pending_precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          vm start
> + *      pending_compatible [output] : pending data which may be migrated in any
> + *           phase
> + *      pending_postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           vm stop
> + *      Sum of pending_precopy_only, pending_compatible and
> + *      pending_postcopy_only is the whole amount of pending data.

What's the significance of the provided threshold, are the pending
bytes limited to threshold size?  It makes me nervous to define a
kernel API in terms of the internal API of a userspace program that can
change.  I wonder if it makes sense to define these in terms of the
state of the devices, pending initial data, runtime data, post-offline
data.

> + *
> + * Flag VFIO_MIGRATION_GET_BUFFER:
> + *      On this flag, vendor driver should write data to migration
> region and
> + *      return number of bytes written in the region.
> + *      bytes_written [output] : number of bytes written in
> migration buffer by
> + *          vendor driver

Does the data the user gets here depend on the device state set
earlier?  For example the user gets pending_precopy_only data while
PRECOPY_ACTIVE is the device state and pending_postcopy_only data
in STOPNCOPY_ACTIVE?  The user should continue to call GET_BUFFER
in a given state until the associated pending field reaches zero?
Jumping between the region and ioctl is rather awkward.

> + *
> + * Flag VFIO_MIGRATION_SET_BUFFER
> + *      In migration resume path, user space app writes to migration
> region and
> + *      communicates it to vendor driver with this ioctl with this
> flag.
> + *      bytes_written [Input] : number of bytes written in migration
> buffer by
> + *          user space app.
> + *
> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
> + *      Get bitmap of dirty pages from vendor driver from given
> start address.
> + *      start_addr [Input] : start address
> + *      pfn_count [Input] : Total pfn count from start_addr for
> which dirty
> + *          bitmap is requested
> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
> + *           application, vendor driver should return the bitmap
> with bits set
> + *           only for pages to be marked dirty.
> + * Return: 0 on success, -errno on failure.
> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 argsz;
> +	__u32 flags;
> +#define VFIO_MIGRATION_PROBE            (1 << 0)
> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
> +        __u32 region_index;	            /* region index */
> +        __u64 size;	                    /* size */
> +        __u32 device_state;                 /* VFIO device state */

We'd need to swap device_state and size for alignment or else this
struct could vary by compiler.

> +        __u64 pending_precopy_only;
> +        __u64 pending_compatible;
> +        __u64 pending_postcopy_only;
> +        __u64 threshold_size;
> +        __u64 bytes_written;
> +        __u64 start_addr;
> +        __u64 pfn_count;
> +	__u8  dirty_bitmap[];
> +};
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};
> +
> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
> +

I'm not entirely sure what this ioctl buys us that we can't do with a
region alone.  For example we could define a migration region as:

struct vfio_region_migration {
	__u32 device_state;
	__u32 data_offset;
	__u64 data_size;
	__u64 pending_pre;
	__u64 pending_runtime;
	__u64 pending_post;
	__u64 bytes_available;
	__u64 bytes_written;
};

A sparse mmap capability in the region could define whether the data
area supports mmap, the above defined registers would only have
read/write access.  The user can write device_state the same as the
ioctl would set it.  The data_offset and size describe the offset in
the region where migration data is relayed and the size of that area.
Pending fields are as you have above, though I've omitted the threshold
because I don't understand it.  The user reading bytes_available is the
same as calling GET_BUFFER, bytes_written the same as SET_BUFFER.  I've
left out the dirty bitmap here, as it's a little less obvious to me how
it would work.  It could be a stand-alone ioctl, it could be
implemented as another region, or it could simply be a different offset
within this region.  The dirty bitmap start address could simply be
calculated as the offset into that region where a pread begins and the
extent is the size of that pread.  With a bit per page, we're only
looking at a 32MB region per 1TB of guest physical address space, so it
doesn't seem too terrible even if guest memory is sparse.  Maybe the
extremes are still problematic though, but if it were part of the same
region above we could solve that with another register written by the
user to indicate the base offset of the dirty bitmap window.  For
example if the window is 32MB and page size 4k then it can index 1TB
and the user would write 0 (default) to the register for pfns from
0 to (1TB - 1), 1 for pfns from 1TB to (2TB - 1), etc.

I don't see that this interface does anything to address the
compatibility concerns that were discussed in the previous iterations
from Intel though, nor is it documented exactly where and how a vendor
driver would indicate a migration failure if it had its own internal
consistency or compatibility checks fail.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface
@ 2018-10-16 22:34     ` Alex Williamson
  0 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2018-10-16 22:34 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, qemu-devel, kvm, Yulei Zhang, Dr. David Alan Gilbert,
	Juan Quintela, Wang, Zhi A

On Tue, 16 Oct 2018 23:42:35 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Added vfio_device_migration_info structure to use interact with vendor
>   driver.
> - Different flags are used to get or set migration related information
>   from/to vendor driver.
> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
>   from vendor driver
> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
>   data to migration region and return number of bytes written in the region
> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
>   writes to migration region and communicates it to vendor driver with
>   this ioctl with this flag.
> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
>   driver from given start address
> 
> - Added enum for possible device states.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 91 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..8e9045ed9aa8 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
>  
>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/**
> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> + *                              struct vfio_device_migration_info)

This is quite a bit more than an "INFO" ioctl.

> + * Flag VFIO_MIGRATION_PROBE:
> + *      To query if feature is supported
> + *
> + * Flag VFIO_MIGRATION_GET_REGION:
> + *      To get migration region info
> + *      region_index [output] : region index to be used for migration region
> + *      size [output]: size of migration region

Of course the region migration region can describe itself as being used
for migration, so this is unnecessary.  The presence of that region
could also negate the need for a probe.

> + *
> + * Flag VFIO_MIGRATION_SET_STATE:
> + *      To set device state in vendor driver
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change

Valid states are the enum defined below, correct?

Does setting STOPNCOPY_ACTIVE stop any state change of the device or is
that expected to happen through other means?

What are the allowable state transitions?

How many bits in flags is a user allowed to set at once?

> + * Flag VFIO_MIGRATION_GET_PENDING:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      threshold_size [Input] : threshold of buffer in User space app.
> + *      pending_precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          vm start
> + *      pending_compatible [output] : pending data which may be migrated in any
> + *           phase
> + *      pending_postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           vm stop
> + *      Sum of pending_precopy_only, pending_compatible and
> + *      pending_postcopy_only is the whole amount of pending data.

What's the significance of the provided threshold, are the pending
bytes limited to threshold size?  It makes me nervous to define a
kernel API in terms of the internal API of a userspace program that can
change.  I wonder if it makes sense to define these in terms of the
state of the devices, pending initial data, runtime data, post-offline
data.

> + *
> + * Flag VFIO_MIGRATION_GET_BUFFER:
> + *      On this flag, vendor driver should write data to migration
> region and
> + *      return number of bytes written in the region.
> + *      bytes_written [output] : number of bytes written in
> migration buffer by
> + *          vendor driver

Does the data the user gets here depend on the device state set
earlier?  For example the user gets pending_precopy_only data while
PRECOPY_ACTIVE is the device state and pending_postcopy_only data
in STOPNCOPY_ACTIVE?  The user should continue to call GET_BUFFER
in a given state until the associated pending field reaches zero?
Jumping between the region and ioctl is rather awkward.

> + *
> + * Flag VFIO_MIGRATION_SET_BUFFER
> + *      In migration resume path, user space app writes to migration
> region and
> + *      communicates it to vendor driver with this ioctl with this
> flag.
> + *      bytes_written [Input] : number of bytes written in migration
> buffer by
> + *          user space app.
> + *
> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
> + *      Get bitmap of dirty pages from vendor driver from given
> start address.
> + *      start_addr [Input] : start address
> + *      pfn_count [Input] : Total pfn count from start_addr for
> which dirty
> + *          bitmap is requested
> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
> + *           application, vendor driver should return the bitmap
> with bits set
> + *           only for pages to be marked dirty.
> + * Return: 0 on success, -errno on failure.
> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 argsz;
> +	__u32 flags;
> +#define VFIO_MIGRATION_PROBE            (1 << 0)
> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
> +        __u32 region_index;	            /* region index */
> +        __u64 size;	                    /* size */
> +        __u32 device_state;                 /* VFIO device state */

We'd need to swap device_state and size for alignment or else this
struct could vary by compiler.

> +        __u64 pending_precopy_only;
> +        __u64 pending_compatible;
> +        __u64 pending_postcopy_only;
> +        __u64 threshold_size;
> +        __u64 bytes_written;
> +        __u64 start_addr;
> +        __u64 pfn_count;
> +	__u8  dirty_bitmap[];
> +};
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};
> +
> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
> +

I'm not entirely sure what this ioctl buys us that we can't do with a
region alone.  For example we could define a migration region as:

struct vfio_region_migration {
	__u32 device_state;
	__u32 data_offset;
	__u64 data_size;
	__u64 pending_pre;
	__u64 pending_runtime;
	__u64 pending_post;
	__u64 bytes_available;
	__u64 bytes_written;
};

A sparse mmap capability in the region could define whether the data
area supports mmap, the above defined registers would only have
read/write access.  The user can write device_state the same as the
ioctl would set it.  The data_offset and size describe the offset in
the region where migration data is relayed and the size of that area.
Pending fields are as you have above, though I've omitted the threshold
because I don't understand it.  The user reading bytes_available is the
same as calling GET_BUFFER, bytes_written the same as SET_BUFFER.  I've
left out the dirty bitmap here, as it's a little less obvious to me how
it would work.  It could be a stand-alone ioctl, it could be
implemented as another region, or it could simply be a different offset
within this region.  The dirty bitmap start address could simply be
calculated as the offset into that region where a pread begins and the
extent is the size of that pread.  With a bit per page, we're only
looking at a 32MB region per 1TB of guest physical address space, so it
doesn't seem too terrible even if guest memory is sparse.  Maybe the
extremes are still problematic though, but if it were part of the same
region above we could solve that with another register written by the
user to indicate the base offset of the dirty bitmap window.  For
example if the window is 32MB and page size 4k then it can index 1TB
and the user would write 0 (default) to the register for pfns from
0 to (1TB - 1), 1 for pfns from 1TB to (2TB - 1), etc.

I don't see that this interface does anything to address the
compatibility concerns that were discussed in the previous iterations
from Intel though, nor is it documented exactly where and how a vendor
driver would indicate a migration failure if it had its own internal
consistency or compatibility checks fail.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 0/4] Add migration support for VFIO device
  2018-10-16 18:12 ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-17  8:49   ` Cornelia Huck
  -1 siblings, 0 replies; 36+ messages in thread
From: Cornelia Huck @ 2018-10-17  8:49 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tony Krowiak, linux-s390, cjia, Pierre Morel, kvm, qemu-devel,
	Halil Pasic, alex.williamson

On Tue, 16 Oct 2018 23:42:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Add migration support for VFIO device

I'd love to take a deeper look at this; but sadly, I'm currently low on
spare time, and therefore will only add some general remarks.

> 
> This Patch set include patches as below:
> - Define KABI for VFIO device for migration support.
> - Generic migration functionality for VFIO device.
>   * This patch set adds functionality only for PCI devices, but can be
>     extended to other VFIO devices.

I've been thinking about how best to add migration to vfio-ccw; it
might be quite different from what we do for vfio-pci, but I hope that
we can still use some common infrastructure.

Basically, we probably need something like the following:

- Wait until outstanding channel programs are finished, and don't allow
  new ones. Don't report back to userspace (or maybe do; we just don't
  want to fence of too many new requests.)
- Collect subchannel and related state, copy it over. That probably
  fits in with the interface you're proposing.
- Re-setup on the other side and get going again. The problem here
  might be state that the vfio-ccw driver can't be aware of (like
  replaying storage server configurations). Maybe we can send CRWs
  (notifications) to the guest and leverage the existing suspend/resume
  driver code for channel devices.

For vfio-ap, I frankly have no idea how migration will work, given the
setup with the matrix device and the interesting information in the SIE
control blocks (but maybe it just can replay from the info in the
matrix device?) Keys etc. are likely to be the bigger problem here.

Cc:ing some s390 folks for awareness.

>   * Added all the basic functions required for pre-copy, stop-and-copy and
>     resume phases of migration.
>   * Added state change notifier and from that notifier function, VFIO
>     device's state changed is conveyed to VFIO vendor driver.

I assume that this will, in general, be transparent for the guest; but
maybe there'll be need for some interaction for special cases?

>   * During save setup phase and resume/load setup phase, migration region
>     is queried from vendor driver and is mmaped by QEMU. This region is
>     used to read/write data from and to vendor driver.
>   * .save_live_pending, .save_live_iterate and .is_active_iterate are
>     implemented to use QEMU's functionality of iteration during pre-copy
>     phase.
>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>     iteration to read data from vendor driver is implemented till pending
>     bytes returned by vendor driver are not zero.
>   * .save_cleanup and .load_cleanup are implemented to unmap migration
>     region that was setup duing setup phase.
>   * Added function to get dirty pages bitmap from vendor driver.
> - Add vfio_listerner_log_sync to mark dirty pages.
> - Make VFIO PCI device migration capable.
> 
> Thanks,
> Kirti
> 
> Kirti Wankhede (4):
>   VFIO KABI for migration interface
>   Add migration functions for VFIO devices
>   Add vfio_listerner_log_sync to mark dirty pages
>   Make vfio-pci device migration capable.
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  32 ++
>  hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  13 +-
>  include/hw/vfio/vfio-common.h |  23 ++
>  linux-headers/linux/vfio.h    |  91 ++++++
>  6 files changed, 869 insertions(+), 8 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 0/4] Add migration support for VFIO device
@ 2018-10-17  8:49   ` Cornelia Huck
  0 siblings, 0 replies; 36+ messages in thread
From: Cornelia Huck @ 2018-10-17  8:49 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, qemu-devel, kvm, Halil Pasic,
	Tony Krowiak, Pierre Morel, linux-s390

On Tue, 16 Oct 2018 23:42:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Add migration support for VFIO device

I'd love to take a deeper look at this; but sadly, I'm currently low on
spare time, and therefore will only add some general remarks.

> 
> This Patch set include patches as below:
> - Define KABI for VFIO device for migration support.
> - Generic migration functionality for VFIO device.
>   * This patch set adds functionality only for PCI devices, but can be
>     extended to other VFIO devices.

I've been thinking about how best to add migration to vfio-ccw; it
might be quite different from what we do for vfio-pci, but I hope that
we can still use some common infrastructure.

Basically, we probably need something like the following:

- Wait until outstanding channel programs are finished, and don't allow
  new ones. Don't report back to userspace (or maybe do; we just don't
  want to fence of too many new requests.)
- Collect subchannel and related state, copy it over. That probably
  fits in with the interface you're proposing.
- Re-setup on the other side and get going again. The problem here
  might be state that the vfio-ccw driver can't be aware of (like
  replaying storage server configurations). Maybe we can send CRWs
  (notifications) to the guest and leverage the existing suspend/resume
  driver code for channel devices.

For vfio-ap, I frankly have no idea how migration will work, given the
setup with the matrix device and the interesting information in the SIE
control blocks (but maybe it just can replay from the info in the
matrix device?) Keys etc. are likely to be the bigger problem here.

Cc:ing some s390 folks for awareness.

>   * Added all the basic functions required for pre-copy, stop-and-copy and
>     resume phases of migration.
>   * Added state change notifier and from that notifier function, VFIO
>     device's state changed is conveyed to VFIO vendor driver.

I assume that this will, in general, be transparent for the guest; but
maybe there'll be need for some interaction for special cases?

>   * During save setup phase and resume/load setup phase, migration region
>     is queried from vendor driver and is mmaped by QEMU. This region is
>     used to read/write data from and to vendor driver.
>   * .save_live_pending, .save_live_iterate and .is_active_iterate are
>     implemented to use QEMU's functionality of iteration during pre-copy
>     phase.
>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>     iteration to read data from vendor driver is implemented till pending
>     bytes returned by vendor driver are not zero.
>   * .save_cleanup and .load_cleanup are implemented to unmap migration
>     region that was setup duing setup phase.
>   * Added function to get dirty pages bitmap from vendor driver.
> - Add vfio_listerner_log_sync to mark dirty pages.
> - Make VFIO PCI device migration capable.
> 
> Thanks,
> Kirti
> 
> Kirti Wankhede (4):
>   VFIO KABI for migration interface
>   Add migration functions for VFIO devices
>   Add vfio_listerner_log_sync to mark dirty pages
>   Make vfio-pci device migration capable.
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  32 ++
>  hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  13 +-
>  include/hw/vfio/vfio-common.h |  23 ++
>  linux-headers/linux/vfio.h    |  91 ++++++
>  6 files changed, 869 insertions(+), 8 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 2/4] Add migration functions for VFIO devices
  2018-10-16 18:12   ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-17  9:00     ` Cornelia Huck
  -1 siblings, 0 replies; 36+ messages in thread
From: Cornelia Huck @ 2018-10-17  9:00 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: alex.williamson, cjia, qemu-devel, kvm

On Tue, 16 Oct 2018 23:42:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 740 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index a2e7a0a7cf02..6206ad47e90e 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o

I don't think you want to make this explicitly pci-specific, but build
in a proper split of common infrastructure and pci handling from the
beginning.

>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..8a4f515226e0
> --- /dev/null
> +++ b/hw/vfio/migration.c

(...)

> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +        PCIDevice *pdev = &vdev->pdev;
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +        int i;
> +
> +        for (i = 0; i < PCI_ROM_SLOT; i++) {
> +            uint32_t bar;
> +
> +            bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +            qemu_put_be32(f, bar);
> +        }
> +
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        qemu_put_be32(f, msi_addr_lo);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                            4);
> +        }
> +        qemu_put_be32(f, msi_addr_hi);
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +        qemu_put_be32(f, msi_data);
> +    }

This is an example of a function that should go into a pci-specific
file. Especially as config space is not a universal concept.

> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 2/4] Add migration functions for VFIO devices
@ 2018-10-17  9:00     ` Cornelia Huck
  0 siblings, 0 replies; 36+ messages in thread
From: Cornelia Huck @ 2018-10-17  9:00 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: alex.williamson, cjia, qemu-devel, kvm

On Tue, 16 Oct 2018 23:42:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 740 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index a2e7a0a7cf02..6206ad47e90e 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o

I don't think you want to make this explicitly pci-specific, but build
in a proper split of common infrastructure and pci handling from the
beginning.

>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..8a4f515226e0
> --- /dev/null
> +++ b/hw/vfio/migration.c

(...)

> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +        PCIDevice *pdev = &vdev->pdev;
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +        int i;
> +
> +        for (i = 0; i < PCI_ROM_SLOT; i++) {
> +            uint32_t bar;
> +
> +            bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +            qemu_put_be32(f, bar);
> +        }
> +
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        qemu_put_be32(f, msi_addr_lo);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                            4);
> +        }
> +        qemu_put_be32(f, msi_addr_hi);
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +        qemu_put_be32(f, msi_data);
> +    }

This is an example of a function that should go into a pci-specific
file. Especially as config space is not a universal concept.

> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
  2018-10-16 18:12   ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-17  9:06     ` Christoph Hellwig
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2018-10-17  9:06 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: alex.williamson, cjia, qemu-devel, kvm

On Tue, Oct 16, 2018 at 11:42:35PM +0530, Kirti Wankhede wrote:
> - Added vfio_device_migration_info structure to use interact with vendor
>   driver.

There is no such thing as a 'vendor driver' in Linux - all drivers ate
treated equal.  And I don't see any single driver supporting this yet,
so you really should submit your noveau or whatever patches first.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface
@ 2018-10-17  9:06     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2018-10-17  9:06 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: alex.williamson, cjia, qemu-devel, kvm

On Tue, Oct 16, 2018 at 11:42:35PM +0530, Kirti Wankhede wrote:
> - Added vfio_device_migration_info structure to use interact with vendor
>   driver.

There is no such thing as a 'vendor driver' in Linux - all drivers ate
treated equal.  And I don't see any single driver supporting this yet,
so you really should submit your noveau or whatever patches first.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
  2018-10-16 18:12   ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-17 10:09     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 36+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-17 10:09 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: alex.williamson, cjia, qemu-devel, kvm, quintela

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> - Added vfio_device_migration_info structure to use interact with vendor
>   driver.
> - Different flags are used to get or set migration related information
>   from/to vendor driver.
> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
>   from vendor driver
> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
>   data to migration region and return number of bytes written in the region
> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
>   writes to migration region and communicates it to vendor driver with
>   this ioctl with this flag.
> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
>   driver from given start address
> 
> - Added enum for possible device states.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 91 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..8e9045ed9aa8 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
>  
>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/**
> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> + *                              struct vfio_device_migration_info)

This is a little odd; what you really have here is 7 different
operations that can be performed; they're independent operations all
relating to part of migration; so instead of a 'flag' it's more of an
operation type, and there's no reason to make them individual bit flags
- for edxample there's no reason you'd want to OR together
MIGRATION_GET_REGION and MIGRATION_GET_PENDING - they're just
independent.
(Similarly you could just make them 7 ioctls)

I guess what you're trying to do here is a direct 1-1 mapping of qemu's
struct SaveVMHandlers interface to devices.
That's not necessarily a bad idea, but remember it's not a stable
interface, so QEMU will change it over time and you'll have to then
figure out how to shim these interfaces to it, so it's worth keeping
that in mind, just in case you can make these interfaces more general.

> + * Flag VFIO_MIGRATION_PROBE:
> + *      To query if feature is supported
> + *
> + * Flag VFIO_MIGRATION_GET_REGION:
> + *      To get migration region info
> + *      region_index [output] : region index to be used for migration region
> + *      size [output]: size of migration region
> + *
> + * Flag VFIO_MIGRATION_SET_STATE:
> + *      To set device state in vendor driver
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change
> + *
> + * Flag VFIO_MIGRATION_GET_PENDING:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      threshold_size [Input] : threshold of buffer in User space app.
> + *      pending_precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          vm start
> + *      pending_compatible [output] : pending data which may be migrated in any
> + *           phase
> + *      pending_postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           vm stop
> + *      Sum of pending_precopy_only, pending_compatible and
> + *      pending_postcopy_only is the whole amount of pending data.
> + *
> + * Flag VFIO_MIGRATION_GET_BUFFER:
> + *      On this flag, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      bytes_written [output] : number of bytes written in migration buffer by
> + *          vendor driver
> + *
> + * Flag VFIO_MIGRATION_SET_BUFFER
> + *      In migration resume path, user space app writes to migration region and
> + *      communicates it to vendor driver with this ioctl with this flag.
> + *      bytes_written [Input] : number of bytes written in migration buffer by
> + *          user space app.

I guess we call getbuffer/setbuffer multiple times.  Is there anything
that needs to be passed to get the data to line up (e.g. offset into the
data)

> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
> + *      Get bitmap of dirty pages from vendor driver from given start address.
> + *      start_addr [Input] : start address
> + *      pfn_count [Input] : Total pfn count from start_addr for which dirty
> + *          bitmap is requested
> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
> + *           application, vendor driver should return the bitmap with bits set
> + *           only for pages to be marked dirty.
> + * Return: 0 on success, -errno on failure.
> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 argsz;
> +	__u32 flags;
> +#define VFIO_MIGRATION_PROBE            (1 << 0)
> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
> +        __u32 region_index;	            /* region index */
> +        __u64 size;	                    /* size */
> +        __u32 device_state;                 /* VFIO device state */
> +        __u64 pending_precopy_only;
> +        __u64 pending_compatible;
> +        __u64 pending_postcopy_only;
> +        __u64 threshold_size;
> +        __u64 bytes_written;
> +        __u64 start_addr;
> +        __u64 pfn_count;
> +	__u8  dirty_bitmap[];
> +};

This feels like it should be separate types for the different calls.

Also, is there no state to tell the userspace what version of migration
data we have?  What happens if I try and load a migration state
from an older driver into a newer one, or the other way around,
or into a destination card that's not the same as the source.

> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};

That's an interesting merge of QEMU's run-state and it's migration
state.

You probably need to define which transitions you take as legal.

Dave

> +
> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**
> -- 
> 2.7.0
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface
@ 2018-10-17 10:09     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 36+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-17 10:09 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: alex.williamson, cjia, qemu-devel, kvm, quintela

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> - Added vfio_device_migration_info structure to use interact with vendor
>   driver.
> - Different flags are used to get or set migration related information
>   from/to vendor driver.
> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
>   from vendor driver
> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
>   data to migration region and return number of bytes written in the region
> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
>   writes to migration region and communicates it to vendor driver with
>   this ioctl with this flag.
> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
>   driver from given start address
> 
> - Added enum for possible device states.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 91 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..8e9045ed9aa8 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
>  
>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/**
> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> + *                              struct vfio_device_migration_info)

This is a little odd; what you really have here is 7 different
operations that can be performed; they're independent operations all
relating to part of migration; so instead of a 'flag' it's more of an
operation type, and there's no reason to make them individual bit flags
- for edxample there's no reason you'd want to OR together
MIGRATION_GET_REGION and MIGRATION_GET_PENDING - they're just
independent.
(Similarly you could just make them 7 ioctls)

I guess what you're trying to do here is a direct 1-1 mapping of qemu's
struct SaveVMHandlers interface to devices.
That's not necessarily a bad idea, but remember it's not a stable
interface, so QEMU will change it over time and you'll have to then
figure out how to shim these interfaces to it, so it's worth keeping
that in mind, just in case you can make these interfaces more general.

> + * Flag VFIO_MIGRATION_PROBE:
> + *      To query if feature is supported
> + *
> + * Flag VFIO_MIGRATION_GET_REGION:
> + *      To get migration region info
> + *      region_index [output] : region index to be used for migration region
> + *      size [output]: size of migration region
> + *
> + * Flag VFIO_MIGRATION_SET_STATE:
> + *      To set device state in vendor driver
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change
> + *
> + * Flag VFIO_MIGRATION_GET_PENDING:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      threshold_size [Input] : threshold of buffer in User space app.
> + *      pending_precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          vm start
> + *      pending_compatible [output] : pending data which may be migrated in any
> + *           phase
> + *      pending_postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           vm stop
> + *      Sum of pending_precopy_only, pending_compatible and
> + *      pending_postcopy_only is the whole amount of pending data.
> + *
> + * Flag VFIO_MIGRATION_GET_BUFFER:
> + *      On this flag, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      bytes_written [output] : number of bytes written in migration buffer by
> + *          vendor driver
> + *
> + * Flag VFIO_MIGRATION_SET_BUFFER
> + *      In migration resume path, user space app writes to migration region and
> + *      communicates it to vendor driver with this ioctl with this flag.
> + *      bytes_written [Input] : number of bytes written in migration buffer by
> + *          user space app.

I guess we call getbuffer/setbuffer multiple times.  Is there anything
that needs to be passed to get the data to line up (e.g. offset into the
data)

> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
> + *      Get bitmap of dirty pages from vendor driver from given start address.
> + *      start_addr [Input] : start address
> + *      pfn_count [Input] : Total pfn count from start_addr for which dirty
> + *          bitmap is requested
> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
> + *           application, vendor driver should return the bitmap with bits set
> + *           only for pages to be marked dirty.
> + * Return: 0 on success, -errno on failure.
> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 argsz;
> +	__u32 flags;
> +#define VFIO_MIGRATION_PROBE            (1 << 0)
> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
> +        __u32 region_index;	            /* region index */
> +        __u64 size;	                    /* size */
> +        __u32 device_state;                 /* VFIO device state */
> +        __u64 pending_precopy_only;
> +        __u64 pending_compatible;
> +        __u64 pending_postcopy_only;
> +        __u64 threshold_size;
> +        __u64 bytes_written;
> +        __u64 start_addr;
> +        __u64 pfn_count;
> +	__u8  dirty_bitmap[];
> +};

This feels like it should be separate types for the different calls.

Also, is there no state to tell the userspace what version of migration
data we have?  What happens if I try and load a migration state
from an older driver into a newer one, or the other way around,
or into a destination card that's not the same as the source.

> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};

That's an interesting merge of QEMU's run-state and it's migration
state.

You probably need to define which transitions you take as legal.

Dave

> +
> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**
> -- 
> 2.7.0
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
  2018-10-16 22:34     ` [Qemu-devel] " Alex Williamson
@ 2018-10-17 20:47       ` Kirti Wankhede
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-17 20:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Yulei Zhang, Wang, Zhi A



On 10/17/2018 4:04 AM, Alex Williamson wrote:
> On Tue, 16 Oct 2018 23:42:35 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Added vfio_device_migration_info structure to use interact with vendor
>>   driver.
>> - Different flags are used to get or set migration related information
>>   from/to vendor driver.
>> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
>> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
>> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
>> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
>>   from vendor driver
>> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
>>   data to migration region and return number of bytes written in the region
>> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
>>   writes to migration region and communicates it to vendor driver with
>>   this ioctl with this flag.
>> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
>>   driver from given start address
>>
>> - Added enum for possible device states.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 91 insertions(+)
>>
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 3615a269d378..8e9045ed9aa8 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
>>  
>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>>  
>> +/**
>> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
>> + *                              struct vfio_device_migration_info)
> 
> This is quite a bit more than an "INFO" ioctl.
> 
>> + * Flag VFIO_MIGRATION_PROBE:
>> + *      To query if feature is supported
>> + *
>> + * Flag VFIO_MIGRATION_GET_REGION:
>> + *      To get migration region info
>> + *      region_index [output] : region index to be used for migration region
>> + *      size [output]: size of migration region
> 
> Of course the region migration region can describe itself as being used
> for migration, so this is unnecessary.  The presence of that region
> could also negate the need for a probe.
> 

Yes, that can be done.


>> + *
>> + * Flag VFIO_MIGRATION_SET_STATE:
>> + *      To set device state in vendor driver
>> + *      device_state [input] : User space app sends device state to vendor
>> + *           driver on state change
> 
> Valid states are the enum defined below, correct?
> 

Yes, that's correct.

> Does setting STOPNCOPY_ACTIVE stop any state change of the device or is
> that expected to happen through other means?
> 

_PRECOPY_ACTIVE means vCPUs are still running, so VFIO device should
still remain active.
_STOPNCOPY_ACTIVE means vCPUs are not running and device should also be
stopped and copy device's state.

> What are the allowable state transitions?
> 

Normal VM running case:
_NONE -> _RUNNING

In case of live migration, at source:
_RUNNING -> _SETUP -> _PRECOPY_ACTIVE -> _STOPNCOPY_ACTIVE ->
_SAVE_COMPLETED

at destination:
_NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING

In save VM case:
_RUNNING -> _SETUP -> _STOPNCOPY_ACTIVE -> _SAVE_COMPLETED

In case of resuming VM from saved state:
_NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING

_FAILED or _CANCELLED can happen in any state.

> How many bits in flags is a user allowed to set at once?
> 

One bit at a time. Probably, I should use enum for flags rather than bits.

>> + * Flag VFIO_MIGRATION_GET_PENDING:
>> + *      To get pending bytes yet to be migrated from vendor driver
>> + *      threshold_size [Input] : threshold of buffer in User space app.
>> + *      pending_precopy_only [output] : pending data which must be migrated in
>> + *          precopy phase or in stopped state, in other words - before target
>> + *          vm start
>> + *      pending_compatible [output] : pending data which may be migrated in any
>> + *           phase
>> + *      pending_postcopy_only [output] : pending data which must be migrated in
>> + *           postcopy phase or in stopped state, in other words - after source
>> + *           vm stop
>> + *      Sum of pending_precopy_only, pending_compatible and
>> + *      pending_postcopy_only is the whole amount of pending data.
> 
> What's the significance of the provided threshold, are the pending
> bytes limited to threshold size?  It makes me nervous to define a
> kernel API in terms of the internal API of a userspace program that can
> change.  I wonder if it makes sense to define these in terms of the
> state of the devices, pending initial data, runtime data, post-offline
> data.
> 

Threshold is required, because that will tell size in bytes that user
space application buffer can accommodate. Driver can copy data less than
threshold, but copying data more than threshold doesn't make sense
because user space application won't be able to copy that extra data and
that data might get overwritten or lost.


>> + *
>> + * Flag VFIO_MIGRATION_GET_BUFFER:
>> + *      On this flag, vendor driver should write data to migration
>> region and
>> + *      return number of bytes written in the region.
>> + *      bytes_written [output] : number of bytes written in
>> migration buffer by
>> + *          vendor driver
> 
> Does the data the user gets here depend on the device state set
> earlier?  For example the user gets pending_precopy_only data while
> PRECOPY_ACTIVE is the device state and pending_postcopy_only data
> in STOPNCOPY_ACTIVE?  The user should continue to call GET_BUFFER
> in a given state until the associated pending field reaches zero?
> Jumping between the region and ioctl is rather awkward.
> 

User gets pending_precopy_only data when device is in PRECOPY_ACTIVE
state, but each time when user calls GET_BUFFER, pending bytes might
change.
VFIO device's driver is producer of data and user/QEMU is consumer of
data. In pre-copy phase, when vCPUs are still running, driver will try
to accumulate as much data as possible in this phase, but vCPUs are
running and user of that device/application in guest is actively using
that device, then there are chances that during next iteration of
GET_BUFFER, driver might have more data.
Even in case of STOPNCOPY_ACTIVE state of device, driver can start
sending data in parts while a thread in vendor driver can still generate
data after device has halted, producer and consumer can run in parallel.
So User has to call GET_BUFFER until pending bytes are returned 0.

>> + *
>> + * Flag VFIO_MIGRATION_SET_BUFFER
>> + *      In migration resume path, user space app writes to migration
>> region and
>> + *      communicates it to vendor driver with this ioctl with this
>> flag.
>> + *      bytes_written [Input] : number of bytes written in migration
>> buffer by
>> + *          user space app.
>> + *
>> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
>> + *      Get bitmap of dirty pages from vendor driver from given
>> start address.
>> + *      start_addr [Input] : start address
>> + *      pfn_count [Input] : Total pfn count from start_addr for
>> which dirty
>> + *          bitmap is requested
>> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
>> + *           application, vendor driver should return the bitmap
>> with bits set
>> + *           only for pages to be marked dirty.
>> + * Return: 0 on success, -errno on failure.
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +#define VFIO_MIGRATION_PROBE            (1 << 0)
>> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
>> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
>> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
>> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
>> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
>> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
>> +        __u32 region_index;	            /* region index */
>> +        __u64 size;	                    /* size */
>> +        __u32 device_state;                 /* VFIO device state */
> 
> We'd need to swap device_state and size for alignment or else this
> struct could vary by compiler.
> 

Ok. Thanks for pointing that out. I'll change that.

>> +        __u64 pending_precopy_only;
>> +        __u64 pending_compatible;
>> +        __u64 pending_postcopy_only;
>> +        __u64 threshold_size;
>> +        __u64 bytes_written;
>> +        __u64 start_addr;
>> +        __u64 pfn_count;
>> +	__u8  dirty_bitmap[];
>> +};
>> +
>> +enum {
>> +    VFIO_DEVICE_STATE_NONE,
>> +    VFIO_DEVICE_STATE_RUNNING,
>> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
>> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
>> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
>> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
>> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
>> +};
>> +
>> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
>> +
> 
> I'm not entirely sure what this ioctl buys us that we can't do with a
> region alone.  For example we could define a migration region as:
> 
> struct vfio_region_migration {
> 	__u32 device_state;
> 	__u32 data_offset;
> 	__u64 data_size;
> 	__u64 pending_pre;
> 	__u64 pending_runtime;
> 	__u64 pending_post;
> 	__u64 bytes_available;
> 	__u64 bytes_written;
> };
> 
> A sparse mmap capability in the region could define whether the data
> area supports mmap, the above defined registers would only have
> read/write access.  The user can write device_state the same as the
> ioctl would set it.  The data_offset and size describe the offset in
> the region where migration data is relayed and the size of that area.
> Pending fields are as you have above, though I've omitted the threshold
> because I don't understand it.  The user reading bytes_available is the
> same as calling GET_BUFFER, bytes_written the same as SET_BUFFER.  

Yes, this could be done. Read/write on structure will be same as ioctl,
i.e. block till expected operation is done.

> I've
> left out the dirty bitmap here, as it's a little less obvious to me how
> it would work.  It could be a stand-alone ioctl, it could be
> implemented as another region, or it could simply be a different offset
> within this region.  The dirty bitmap start address could simply be
> calculated as the offset into that region where a pread begins and the
> extent is the size of that pread.  With a bit per page, we're only
> looking at a 32MB region per 1TB of guest physical address space, so it
> doesn't seem too terrible even if guest memory is sparse.  Maybe the
> extremes are still problematic though, but if it were part of the same
> region above we could solve that with another register written by the
> user to indicate the base offset of the dirty bitmap window. 

Generally guest memory is sparse and log_sync is called for each
MemoryRegionSection. Same region can be used to get dirty page bitmap if
we are going to move to using sparse region.
I think here you are thinking of worst case where a VM with 1TB of
memory have one MemoryRegionSection of 1TB.

> For
> example if the window is 32MB and page size 4k then it can index 1TB
> and the user would write 0 (default) to the register for pfns from
> 0 to (1TB - 1), 1 for pfns from 1TB to (2TB - 1), etc.
> 

Sorry I didn't get this example. By window here, do you meant sparse
region size?


> I don't see that this interface does anything to address the
> compatibility concerns that were discussed in the previous iterations
> from Intel though, nor is it documented exactly where and how a vendor
> driver would indicate a migration failure if it had its own internal
> consistency or compatibility checks fail.  Thanks,
> 

If vendor driver finds any inconsistency in data or if any operation is
causing failure, vendor driver should return error to the corresponding
ioctl. ioctl failure is treated as failure of the particular device. If
one device returns failure during migration process, migration is
terminated and VM at source resumes.

This RFC is mainly focused on core logic of live migration which include
pre-copy, stop-n-copy and resume phases. We can still continue
discussion on compatibility concerns, that could be either separate
change or can be merged in core logic. But I would also like to get some
focus on core logic as well.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface
@ 2018-10-17 20:47       ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-17 20:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, qemu-devel, kvm, Yulei Zhang, Dr. David Alan Gilbert,
	Juan Quintela, Wang, Zhi A



On 10/17/2018 4:04 AM, Alex Williamson wrote:
> On Tue, 16 Oct 2018 23:42:35 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Added vfio_device_migration_info structure to use interact with vendor
>>   driver.
>> - Different flags are used to get or set migration related information
>>   from/to vendor driver.
>> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
>> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
>> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
>> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
>>   from vendor driver
>> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
>>   data to migration region and return number of bytes written in the region
>> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
>>   writes to migration region and communicates it to vendor driver with
>>   this ioctl with this flag.
>> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
>>   driver from given start address
>>
>> - Added enum for possible device states.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 91 insertions(+)
>>
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 3615a269d378..8e9045ed9aa8 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
>>  
>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>>  
>> +/**
>> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
>> + *                              struct vfio_device_migration_info)
> 
> This is quite a bit more than an "INFO" ioctl.
> 
>> + * Flag VFIO_MIGRATION_PROBE:
>> + *      To query if feature is supported
>> + *
>> + * Flag VFIO_MIGRATION_GET_REGION:
>> + *      To get migration region info
>> + *      region_index [output] : region index to be used for migration region
>> + *      size [output]: size of migration region
> 
> Of course the region migration region can describe itself as being used
> for migration, so this is unnecessary.  The presence of that region
> could also negate the need for a probe.
> 

Yes, that can be done.


>> + *
>> + * Flag VFIO_MIGRATION_SET_STATE:
>> + *      To set device state in vendor driver
>> + *      device_state [input] : User space app sends device state to vendor
>> + *           driver on state change
> 
> Valid states are the enum defined below, correct?
> 

Yes, that's correct.

> Does setting STOPNCOPY_ACTIVE stop any state change of the device or is
> that expected to happen through other means?
> 

_PRECOPY_ACTIVE means vCPUs are still running, so VFIO device should
still remain active.
_STOPNCOPY_ACTIVE means vCPUs are not running and device should also be
stopped and copy device's state.

> What are the allowable state transitions?
> 

Normal VM running case:
_NONE -> _RUNNING

In case of live migration, at source:
_RUNNING -> _SETUP -> _PRECOPY_ACTIVE -> _STOPNCOPY_ACTIVE ->
_SAVE_COMPLETED

at destination:
_NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING

In save VM case:
_RUNNING -> _SETUP -> _STOPNCOPY_ACTIVE -> _SAVE_COMPLETED

In case of resuming VM from saved state:
_NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING

_FAILED or _CANCELLED can happen in any state.

> How many bits in flags is a user allowed to set at once?
> 

One bit at a time. Probably, I should use enum for flags rather than bits.

>> + * Flag VFIO_MIGRATION_GET_PENDING:
>> + *      To get pending bytes yet to be migrated from vendor driver
>> + *      threshold_size [Input] : threshold of buffer in User space app.
>> + *      pending_precopy_only [output] : pending data which must be migrated in
>> + *          precopy phase or in stopped state, in other words - before target
>> + *          vm start
>> + *      pending_compatible [output] : pending data which may be migrated in any
>> + *           phase
>> + *      pending_postcopy_only [output] : pending data which must be migrated in
>> + *           postcopy phase or in stopped state, in other words - after source
>> + *           vm stop
>> + *      Sum of pending_precopy_only, pending_compatible and
>> + *      pending_postcopy_only is the whole amount of pending data.
> 
> What's the significance of the provided threshold, are the pending
> bytes limited to threshold size?  It makes me nervous to define a
> kernel API in terms of the internal API of a userspace program that can
> change.  I wonder if it makes sense to define these in terms of the
> state of the devices, pending initial data, runtime data, post-offline
> data.
> 

Threshold is required, because that will tell size in bytes that user
space application buffer can accommodate. Driver can copy data less than
threshold, but copying data more than threshold doesn't make sense
because user space application won't be able to copy that extra data and
that data might get overwritten or lost.


>> + *
>> + * Flag VFIO_MIGRATION_GET_BUFFER:
>> + *      On this flag, vendor driver should write data to migration
>> region and
>> + *      return number of bytes written in the region.
>> + *      bytes_written [output] : number of bytes written in
>> migration buffer by
>> + *          vendor driver
> 
> Does the data the user gets here depend on the device state set
> earlier?  For example the user gets pending_precopy_only data while
> PRECOPY_ACTIVE is the device state and pending_postcopy_only data
> in STOPNCOPY_ACTIVE?  The user should continue to call GET_BUFFER
> in a given state until the associated pending field reaches zero?
> Jumping between the region and ioctl is rather awkward.
> 

User gets pending_precopy_only data when device is in PRECOPY_ACTIVE
state, but each time when user calls GET_BUFFER, pending bytes might
change.
VFIO device's driver is producer of data and user/QEMU is consumer of
data. In pre-copy phase, when vCPUs are still running, driver will try
to accumulate as much data as possible in this phase, but vCPUs are
running and user of that device/application in guest is actively using
that device, then there are chances that during next iteration of
GET_BUFFER, driver might have more data.
Even in case of STOPNCOPY_ACTIVE state of device, driver can start
sending data in parts while a thread in vendor driver can still generate
data after device has halted, producer and consumer can run in parallel.
So User has to call GET_BUFFER until pending bytes are returned 0.

>> + *
>> + * Flag VFIO_MIGRATION_SET_BUFFER
>> + *      In migration resume path, user space app writes to migration
>> region and
>> + *      communicates it to vendor driver with this ioctl with this
>> flag.
>> + *      bytes_written [Input] : number of bytes written in migration
>> buffer by
>> + *          user space app.
>> + *
>> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
>> + *      Get bitmap of dirty pages from vendor driver from given
>> start address.
>> + *      start_addr [Input] : start address
>> + *      pfn_count [Input] : Total pfn count from start_addr for
>> which dirty
>> + *          bitmap is requested
>> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
>> + *           application, vendor driver should return the bitmap
>> with bits set
>> + *           only for pages to be marked dirty.
>> + * Return: 0 on success, -errno on failure.
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +#define VFIO_MIGRATION_PROBE            (1 << 0)
>> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
>> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
>> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
>> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
>> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
>> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
>> +        __u32 region_index;	            /* region index */
>> +        __u64 size;	                    /* size */
>> +        __u32 device_state;                 /* VFIO device state */
> 
> We'd need to swap device_state and size for alignment or else this
> struct could vary by compiler.
> 

Ok. Thanks for pointing that out. I'll change that.

>> +        __u64 pending_precopy_only;
>> +        __u64 pending_compatible;
>> +        __u64 pending_postcopy_only;
>> +        __u64 threshold_size;
>> +        __u64 bytes_written;
>> +        __u64 start_addr;
>> +        __u64 pfn_count;
>> +	__u8  dirty_bitmap[];
>> +};
>> +
>> +enum {
>> +    VFIO_DEVICE_STATE_NONE,
>> +    VFIO_DEVICE_STATE_RUNNING,
>> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
>> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
>> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
>> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
>> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
>> +};
>> +
>> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
>> +
> 
> I'm not entirely sure what this ioctl buys us that we can't do with a
> region alone.  For example we could define a migration region as:
> 
> struct vfio_region_migration {
> 	__u32 device_state;
> 	__u32 data_offset;
> 	__u64 data_size;
> 	__u64 pending_pre;
> 	__u64 pending_runtime;
> 	__u64 pending_post;
> 	__u64 bytes_available;
> 	__u64 bytes_written;
> };
> 
> A sparse mmap capability in the region could define whether the data
> area supports mmap, the above defined registers would only have
> read/write access.  The user can write device_state the same as the
> ioctl would set it.  The data_offset and size describe the offset in
> the region where migration data is relayed and the size of that area.
> Pending fields are as you have above, though I've omitted the threshold
> because I don't understand it.  The user reading bytes_available is the
> same as calling GET_BUFFER, bytes_written the same as SET_BUFFER.  

Yes, this could be done. Read/write on structure will be same as ioctl,
i.e. block till expected operation is done.

> I've
> left out the dirty bitmap here, as it's a little less obvious to me how
> it would work.  It could be a stand-alone ioctl, it could be
> implemented as another region, or it could simply be a different offset
> within this region.  The dirty bitmap start address could simply be
> calculated as the offset into that region where a pread begins and the
> extent is the size of that pread.  With a bit per page, we're only
> looking at a 32MB region per 1TB of guest physical address space, so it
> doesn't seem too terrible even if guest memory is sparse.  Maybe the
> extremes are still problematic though, but if it were part of the same
> region above we could solve that with another register written by the
> user to indicate the base offset of the dirty bitmap window. 

Generally guest memory is sparse and log_sync is called for each
MemoryRegionSection. Same region can be used to get dirty page bitmap if
we are going to move to using sparse region.
I think here you are thinking of worst case where a VM with 1TB of
memory have one MemoryRegionSection of 1TB.

> For
> example if the window is 32MB and page size 4k then it can index 1TB
> and the user would write 0 (default) to the register for pfns from
> 0 to (1TB - 1), 1 for pfns from 1TB to (2TB - 1), etc.
> 

Sorry I didn't get this example. By window here, do you meant sparse
region size?


> I don't see that this interface does anything to address the
> compatibility concerns that were discussed in the previous iterations
> from Intel though, nor is it documented exactly where and how a vendor
> driver would indicate a migration failure if it had its own internal
> consistency or compatibility checks fail.  Thanks,
> 

If vendor driver finds any inconsistency in data or if any operation is
causing failure, vendor driver should return error to the corresponding
ioctl. ioctl failure is treated as failure of the particular device. If
one device returns failure during migration process, migration is
terminated and VM at source resumes.

This RFC is mainly focused on core logic of live migration which include
pre-copy, stop-n-copy and resume phases. We can still continue
discussion on compatibility concerns, that could be either separate
change or can be merged in core logic. But I would also like to get some
focus on core logic as well.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 0/4] Add migration support for VFIO device
  2018-10-17  8:49   ` [Qemu-devel] " Cornelia Huck
@ 2018-10-17 20:59     ` Kirti Wankhede
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-17 20:59 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Tony Krowiak, linux-s390, cjia, Pierre Morel, kvm, qemu-devel,
	Halil Pasic, alex.williamson



On 10/17/2018 2:19 PM, Cornelia Huck wrote:
> On Tue, 16 Oct 2018 23:42:34 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Add migration support for VFIO device
> 
> I'd love to take a deeper look at this; but sadly, I'm currently low on
> spare time, and therefore will only add some general remarks.
> 

Thanks. Those would be really helpful to have common infrastructure that
can be used across different types of vfio devices.

>>
>> This Patch set include patches as below:
>> - Define KABI for VFIO device for migration support.
>> - Generic migration functionality for VFIO device.
>>   * This patch set adds functionality only for PCI devices, but can be
>>     extended to other VFIO devices.
> 
> I've been thinking about how best to add migration to vfio-ccw; it
> might be quite different from what we do for vfio-pci, but I hope that
> we can still use some common infrastructure.
> 
> Basically, we probably need something like the following:
> 
> - Wait until outstanding channel programs are finished, and don't allow
>   new ones. Don't report back to userspace (or maybe do; we just don't
>   want to fence of too many new requests.)
> - Collect subchannel and related state, copy it over. That probably
>   fits in with the interface you're proposing.
> - Re-setup on the other side and get going again. The problem here
>   might be state that the vfio-ccw driver can't be aware of (like
>   replaying storage server configurations). Maybe we can send CRWs
>   (notifications) to the guest and leverage the existing suspend/resume
>   driver code for channel devices.
> 
> For vfio-ap, I frankly have no idea how migration will work, given the
> setup with the matrix device and the interesting information in the SIE
> control blocks (but maybe it just can replay from the info in the
> matrix device?) Keys etc. are likely to be the bigger problem here.
> 
> Cc:ing some s390 folks for awareness.
> 
>>   * Added all the basic functions required for pre-copy, stop-and-copy and
>>     resume phases of migration.
>>   * Added state change notifier and from that notifier function, VFIO
>>     device's state changed is conveyed to VFIO vendor driver.
> 
> I assume that this will, in general, be transparent for the guest; but
> maybe there'll be need for some interaction for special cases?

Ideally, this should be transparent for the guest for live migration.
Guest shouldn't see any functional change during live migration.

Thanks,
Kirti

> 
>>   * During save setup phase and resume/load setup phase, migration region
>>     is queried from vendor driver and is mmaped by QEMU. This region is
>>     used to read/write data from and to vendor driver.
>>   * .save_live_pending, .save_live_iterate and .is_active_iterate are
>>     implemented to use QEMU's functionality of iteration during pre-copy
>>     phase.
>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>>     iteration to read data from vendor driver is implemented till pending
>>     bytes returned by vendor driver are not zero.
>>   * .save_cleanup and .load_cleanup are implemented to unmap migration
>>     region that was setup duing setup phase.
>>   * Added function to get dirty pages bitmap from vendor driver.
>> - Add vfio_listerner_log_sync to mark dirty pages.
>> - Make VFIO PCI device migration capable.
>>
>> Thanks,
>> Kirti
>>
>> Kirti Wankhede (4):
>>   VFIO KABI for migration interface
>>   Add migration functions for VFIO devices
>>   Add vfio_listerner_log_sync to mark dirty pages
>>   Make vfio-pci device migration capable.
>>
>>  hw/vfio/Makefile.objs         |   2 +-
>>  hw/vfio/common.c              |  32 ++
>>  hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/pci.c                 |  13 +-
>>  include/hw/vfio/vfio-common.h |  23 ++
>>  linux-headers/linux/vfio.h    |  91 ++++++
>>  6 files changed, 869 insertions(+), 8 deletions(-)
>>  create mode 100644 hw/vfio/migration.c
>>
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 0/4] Add migration support for VFIO device
@ 2018-10-17 20:59     ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-17 20:59 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, cjia, qemu-devel, kvm, Halil Pasic,
	Tony Krowiak, Pierre Morel, linux-s390



On 10/17/2018 2:19 PM, Cornelia Huck wrote:
> On Tue, 16 Oct 2018 23:42:34 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Add migration support for VFIO device
> 
> I'd love to take a deeper look at this; but sadly, I'm currently low on
> spare time, and therefore will only add some general remarks.
> 

Thanks. Those would be really helpful to have common infrastructure that
can be used across different types of vfio devices.

>>
>> This Patch set include patches as below:
>> - Define KABI for VFIO device for migration support.
>> - Generic migration functionality for VFIO device.
>>   * This patch set adds functionality only for PCI devices, but can be
>>     extended to other VFIO devices.
> 
> I've been thinking about how best to add migration to vfio-ccw; it
> might be quite different from what we do for vfio-pci, but I hope that
> we can still use some common infrastructure.
> 
> Basically, we probably need something like the following:
> 
> - Wait until outstanding channel programs are finished, and don't allow
>   new ones. Don't report back to userspace (or maybe do; we just don't
>   want to fence of too many new requests.)
> - Collect subchannel and related state, copy it over. That probably
>   fits in with the interface you're proposing.
> - Re-setup on the other side and get going again. The problem here
>   might be state that the vfio-ccw driver can't be aware of (like
>   replaying storage server configurations). Maybe we can send CRWs
>   (notifications) to the guest and leverage the existing suspend/resume
>   driver code for channel devices.
> 
> For vfio-ap, I frankly have no idea how migration will work, given the
> setup with the matrix device and the interesting information in the SIE
> control blocks (but maybe it just can replay from the info in the
> matrix device?) Keys etc. are likely to be the bigger problem here.
> 
> Cc:ing some s390 folks for awareness.
> 
>>   * Added all the basic functions required for pre-copy, stop-and-copy and
>>     resume phases of migration.
>>   * Added state change notifier and from that notifier function, VFIO
>>     device's state changed is conveyed to VFIO vendor driver.
> 
> I assume that this will, in general, be transparent for the guest; but
> maybe there'll be need for some interaction for special cases?

Ideally, this should be transparent for the guest for live migration.
Guest shouldn't see any functional change during live migration.

Thanks,
Kirti

> 
>>   * During save setup phase and resume/load setup phase, migration region
>>     is queried from vendor driver and is mmaped by QEMU. This region is
>>     used to read/write data from and to vendor driver.
>>   * .save_live_pending, .save_live_iterate and .is_active_iterate are
>>     implemented to use QEMU's functionality of iteration during pre-copy
>>     phase.
>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>>     iteration to read data from vendor driver is implemented till pending
>>     bytes returned by vendor driver are not zero.
>>   * .save_cleanup and .load_cleanup are implemented to unmap migration
>>     region that was setup duing setup phase.
>>   * Added function to get dirty pages bitmap from vendor driver.
>> - Add vfio_listerner_log_sync to mark dirty pages.
>> - Make VFIO PCI device migration capable.
>>
>> Thanks,
>> Kirti
>>
>> Kirti Wankhede (4):
>>   VFIO KABI for migration interface
>>   Add migration functions for VFIO devices
>>   Add vfio_listerner_log_sync to mark dirty pages
>>   Make vfio-pci device migration capable.
>>
>>  hw/vfio/Makefile.objs         |   2 +-
>>  hw/vfio/common.c              |  32 ++
>>  hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/pci.c                 |  13 +-
>>  include/hw/vfio/vfio-common.h |  23 ++
>>  linux-headers/linux/vfio.h    |  91 ++++++
>>  6 files changed, 869 insertions(+), 8 deletions(-)
>>  create mode 100644 hw/vfio/migration.c
>>
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 2/4] Add migration functions for VFIO devices
  2018-10-17  9:00     ` [Qemu-devel] " Cornelia Huck
@ 2018-10-17 21:03       ` Kirti Wankhede
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-17 21:03 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: alex.williamson, cjia, qemu-devel, kvm



On 10/17/2018 2:30 PM, Cornelia Huck wrote:
> On Tue, 16 Oct 2018 23:42:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
>> - Added SaveVMHandlers and implemented all basic functions required for live
>>   migration.
>> - Added VM state change handler to know running or stopped state of VM.
>> - Added migration state change notifier to get notification on migration state
>>   change. This state is translated to VFIO device state and conveyed to vendor
>>   driver.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/Makefile.objs         |   2 +-
>>  hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |  23 ++
>>  3 files changed, 740 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/vfio/migration.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index a2e7a0a7cf02..6206ad47e90e 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,6 +1,6 @@
>>  ifeq ($(CONFIG_LINUX), y)
>>  obj-$(CONFIG_SOFTMMU) += common.o
>> -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>> +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
> 
> I don't think you want to make this explicitly pci-specific, but build
> in a proper split of common infrastructure and pci handling from the
> beginning.
> 

Yes, I'll split it in next version.

Thanks,
Kirti

>>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>>  obj-$(CONFIG_SOFTMMU) += platform.o
>>  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> new file mode 100644
>> index 000000000000..8a4f515226e0
>> --- /dev/null
>> +++ b/hw/vfio/migration.c
> 
> (...)
> 
>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>> +
>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> +        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +        PCIDevice *pdev = &vdev->pdev;
>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +        bool msi_64bit;
>> +        int i;
>> +
>> +        for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +            uint32_t bar;
>> +
>> +            bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>> +            qemu_put_be32(f, bar);
>> +        }
>> +
>> +        msi_flags = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        msi_addr_lo = pci_default_read_config(pdev,
>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>> +        qemu_put_be32(f, msi_addr_lo);
>> +
>> +        if (msi_64bit) {
>> +            msi_addr_hi = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                            4);
>> +        }
>> +        qemu_put_be32(f, msi_addr_hi);
>> +
>> +        msi_data = pci_default_read_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                2);
>> +        qemu_put_be32(f, msi_data);
>> +    }
> 
> This is an example of a function that should go into a pci-specific
> file. Especially as config space is not a universal concept.
> 
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    return qemu_file_get_error(f);
>> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 2/4] Add migration functions for VFIO devices
@ 2018-10-17 21:03       ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-17 21:03 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: alex.williamson, cjia, qemu-devel, kvm



On 10/17/2018 2:30 PM, Cornelia Huck wrote:
> On Tue, 16 Oct 2018 23:42:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
>> - Added SaveVMHandlers and implemented all basic functions required for live
>>   migration.
>> - Added VM state change handler to know running or stopped state of VM.
>> - Added migration state change notifier to get notification on migration state
>>   change. This state is translated to VFIO device state and conveyed to vendor
>>   driver.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/Makefile.objs         |   2 +-
>>  hw/vfio/migration.c           | 716 ++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |  23 ++
>>  3 files changed, 740 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/vfio/migration.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index a2e7a0a7cf02..6206ad47e90e 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,6 +1,6 @@
>>  ifeq ($(CONFIG_LINUX), y)
>>  obj-$(CONFIG_SOFTMMU) += common.o
>> -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>> +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
> 
> I don't think you want to make this explicitly pci-specific, but build
> in a proper split of common infrastructure and pci handling from the
> beginning.
> 

Yes, I'll split it in next version.

Thanks,
Kirti

>>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>>  obj-$(CONFIG_SOFTMMU) += platform.o
>>  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> new file mode 100644
>> index 000000000000..8a4f515226e0
>> --- /dev/null
>> +++ b/hw/vfio/migration.c
> 
> (...)
> 
>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>> +
>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> +        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +        PCIDevice *pdev = &vdev->pdev;
>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +        bool msi_64bit;
>> +        int i;
>> +
>> +        for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +            uint32_t bar;
>> +
>> +            bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>> +            qemu_put_be32(f, bar);
>> +        }
>> +
>> +        msi_flags = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        msi_addr_lo = pci_default_read_config(pdev,
>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>> +        qemu_put_be32(f, msi_addr_lo);
>> +
>> +        if (msi_64bit) {
>> +            msi_addr_hi = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                            4);
>> +        }
>> +        qemu_put_be32(f, msi_addr_hi);
>> +
>> +        msi_data = pci_default_read_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                2);
>> +        qemu_put_be32(f, msi_data);
>> +    }
> 
> This is an example of a function that should go into a pci-specific
> file. Especially as config space is not a universal concept.
> 
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    return qemu_file_get_error(f);
>> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
  2018-10-17 10:09     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-10-17 21:17       ` Kirti Wankhede
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-17 21:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: alex.williamson, cjia, qemu-devel, kvm, quintela



On 10/17/2018 3:39 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> - Added vfio_device_migration_info structure to use interact with vendor
>>   driver.
>> - Different flags are used to get or set migration related information
>>   from/to vendor driver.
>> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
>> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
>> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
>> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
>>   from vendor driver
>> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
>>   data to migration region and return number of bytes written in the region
>> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
>>   writes to migration region and communicates it to vendor driver with
>>   this ioctl with this flag.
>> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
>>   driver from given start address
>>
>> - Added enum for possible device states.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 91 insertions(+)
>>
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 3615a269d378..8e9045ed9aa8 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
>>  
>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>>  
>> +/**
>> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
>> + *                              struct vfio_device_migration_info)
> 
> This is a little odd; what you really have here is 7 different
> operations that can be performed; they're independent operations all
> relating to part of migration; so instead of a 'flag' it's more of an
> operation type, and there's no reason to make them individual bit flags
> - for edxample there's no reason you'd want to OR together
> MIGRATION_GET_REGION and MIGRATION_GET_PENDING - they're just
> independent.
> (Similarly you could just make them 7 ioctls)
> 

Right, all flags are independent commands. But I tried to use one ioctl
number.

> I guess what you're trying to do here is a direct 1-1 mapping of qemu's
> struct SaveVMHandlers interface to devices.
> That's not necessarily a bad idea, but remember it's not a stable
> interface, so QEMU will change it over time and you'll have to then
> figure out how to shim these interfaces to it, so it's worth keeping
> that in mind, just in case you can make these interfaces more general.
> 

Alex suggested using sparse region instead of ioctl, I'm making note of
your point to define structures when implementing this interface using
sparse mmap region.


>> + * Flag VFIO_MIGRATION_PROBE:
>> + *      To query if feature is supported
>> + *
>> + * Flag VFIO_MIGRATION_GET_REGION:
>> + *      To get migration region info
>> + *      region_index [output] : region index to be used for migration region
>> + *      size [output]: size of migration region
>> + *
>> + * Flag VFIO_MIGRATION_SET_STATE:
>> + *      To set device state in vendor driver
>> + *      device_state [input] : User space app sends device state to vendor
>> + *           driver on state change
>> + *
>> + * Flag VFIO_MIGRATION_GET_PENDING:
>> + *      To get pending bytes yet to be migrated from vendor driver
>> + *      threshold_size [Input] : threshold of buffer in User space app.
>> + *      pending_precopy_only [output] : pending data which must be migrated in
>> + *          precopy phase or in stopped state, in other words - before target
>> + *          vm start
>> + *      pending_compatible [output] : pending data which may be migrated in any
>> + *           phase
>> + *      pending_postcopy_only [output] : pending data which must be migrated in
>> + *           postcopy phase or in stopped state, in other words - after source
>> + *           vm stop
>> + *      Sum of pending_precopy_only, pending_compatible and
>> + *      pending_postcopy_only is the whole amount of pending data.
>> + *
>> + * Flag VFIO_MIGRATION_GET_BUFFER:
>> + *      On this flag, vendor driver should write data to migration region and
>> + *      return number of bytes written in the region.
>> + *      bytes_written [output] : number of bytes written in migration buffer by
>> + *          vendor driver
>> + *
>> + * Flag VFIO_MIGRATION_SET_BUFFER
>> + *      In migration resume path, user space app writes to migration region and
>> + *      communicates it to vendor driver with this ioctl with this flag.
>> + *      bytes_written [Input] : number of bytes written in migration buffer by
>> + *          user space app.
> 
> I guess we call getbuffer/setbuffer multiple times.  Is there anything
> that needs to be passed to get the data to line up (e.g. offset into the
> data)
> 

I think, vendor driver can keep track of offset as it knows what data
copied and what is remaining.


>> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
>> + *      Get bitmap of dirty pages from vendor driver from given start address.
>> + *      start_addr [Input] : start address
>> + *      pfn_count [Input] : Total pfn count from start_addr for which dirty
>> + *          bitmap is requested
>> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
>> + *           application, vendor driver should return the bitmap with bits set
>> + *           only for pages to be marked dirty.
>> + * Return: 0 on success, -errno on failure.
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +#define VFIO_MIGRATION_PROBE            (1 << 0)
>> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
>> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
>> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
>> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
>> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
>> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
>> +        __u32 region_index;	            /* region index */
>> +        __u64 size;	                    /* size */
>> +        __u32 device_state;                 /* VFIO device state */
>> +        __u64 pending_precopy_only;
>> +        __u64 pending_compatible;
>> +        __u64 pending_postcopy_only;
>> +        __u64 threshold_size;
>> +        __u64 bytes_written;
>> +        __u64 start_addr;
>> +        __u64 pfn_count;
>> +	__u8  dirty_bitmap[];
>> +};
> 
> This feels like it should be separate types for the different calls.
> 
> Also, is there no state to tell the userspace what version of migration
> data we have?  What happens if I try and load a migration state
> from an older driver into a newer one, or the other way around,
> or into a destination card that's not the same as the source.
> 

As I mentioned in other reply, this RFC is mainly focused on core logic
of live migration which include implementation for pre-copy, stop-n-copy
and resume phases, defining device states.

>> +enum {
>> +    VFIO_DEVICE_STATE_NONE,
>> +    VFIO_DEVICE_STATE_RUNNING,
>> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
>> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
>> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
>> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
>> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
>> +};
> 
> That's an interesting merge of QEMU's run-state and it's migration
> state.
> 
> You probably need to define which transitions you take as legal.
>

In reply to Alex, I had listed down allowable state transitions.

Thanks,
Kirti

> Dave
> 
>> +
>> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>  
>>  /**
>> -- 
>> 2.7.0
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface
@ 2018-10-17 21:17       ` Kirti Wankhede
  0 siblings, 0 replies; 36+ messages in thread
From: Kirti Wankhede @ 2018-10-17 21:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: alex.williamson, cjia, qemu-devel, kvm, quintela



On 10/17/2018 3:39 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> - Added vfio_device_migration_info structure to use interact with vendor
>>   driver.
>> - Different flags are used to get or set migration related information
>>   from/to vendor driver.
>> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
>> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
>> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
>> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
>>   from vendor driver
>> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
>>   data to migration region and return number of bytes written in the region
>> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
>>   writes to migration region and communicates it to vendor driver with
>>   this ioctl with this flag.
>> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
>>   driver from given start address
>>
>> - Added enum for possible device states.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 91 insertions(+)
>>
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 3615a269d378..8e9045ed9aa8 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
>>  
>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>>  
>> +/**
>> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
>> + *                              struct vfio_device_migration_info)
> 
> This is a little odd; what you really have here is 7 different
> operations that can be performed; they're independent operations all
> relating to part of migration; so instead of a 'flag' it's more of an
> operation type, and there's no reason to make them individual bit flags
> - for edxample there's no reason you'd want to OR together
> MIGRATION_GET_REGION and MIGRATION_GET_PENDING - they're just
> independent.
> (Similarly you could just make them 7 ioctls)
> 

Right, all flags are independent commands. But I tried to use one ioctl
number.

> I guess what you're trying to do here is a direct 1-1 mapping of qemu's
> struct SaveVMHandlers interface to devices.
> That's not necessarily a bad idea, but remember it's not a stable
> interface, so QEMU will change it over time and you'll have to then
> figure out how to shim these interfaces to it, so it's worth keeping
> that in mind, just in case you can make these interfaces more general.
> 

Alex suggested using sparse region instead of ioctl, I'm making note of
your point to define structures when implementing this interface using
sparse mmap region.


>> + * Flag VFIO_MIGRATION_PROBE:
>> + *      To query if feature is supported
>> + *
>> + * Flag VFIO_MIGRATION_GET_REGION:
>> + *      To get migration region info
>> + *      region_index [output] : region index to be used for migration region
>> + *      size [output]: size of migration region
>> + *
>> + * Flag VFIO_MIGRATION_SET_STATE:
>> + *      To set device state in vendor driver
>> + *      device_state [input] : User space app sends device state to vendor
>> + *           driver on state change
>> + *
>> + * Flag VFIO_MIGRATION_GET_PENDING:
>> + *      To get pending bytes yet to be migrated from vendor driver
>> + *      threshold_size [Input] : threshold of buffer in User space app.
>> + *      pending_precopy_only [output] : pending data which must be migrated in
>> + *          precopy phase or in stopped state, in other words - before target
>> + *          vm start
>> + *      pending_compatible [output] : pending data which may be migrated in any
>> + *           phase
>> + *      pending_postcopy_only [output] : pending data which must be migrated in
>> + *           postcopy phase or in stopped state, in other words - after source
>> + *           vm stop
>> + *      Sum of pending_precopy_only, pending_compatible and
>> + *      pending_postcopy_only is the whole amount of pending data.
>> + *
>> + * Flag VFIO_MIGRATION_GET_BUFFER:
>> + *      On this flag, vendor driver should write data to migration region and
>> + *      return number of bytes written in the region.
>> + *      bytes_written [output] : number of bytes written in migration buffer by
>> + *          vendor driver
>> + *
>> + * Flag VFIO_MIGRATION_SET_BUFFER
>> + *      In migration resume path, user space app writes to migration region and
>> + *      communicates it to vendor driver with this ioctl with this flag.
>> + *      bytes_written [Input] : number of bytes written in migration buffer by
>> + *          user space app.
> 
> I guess we call getbuffer/setbuffer multiple times.  Is there anything
> that needs to be passed to get the data to line up (e.g. offset into the
> data)
> 

I think, vendor driver can keep track of offset as it knows what data
copied and what is remaining.


>> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
>> + *      Get bitmap of dirty pages from vendor driver from given start address.
>> + *      start_addr [Input] : start address
>> + *      pfn_count [Input] : Total pfn count from start_addr for which dirty
>> + *          bitmap is requested
>> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
>> + *           application, vendor driver should return the bitmap with bits set
>> + *           only for pages to be marked dirty.
>> + * Return: 0 on success, -errno on failure.
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +#define VFIO_MIGRATION_PROBE            (1 << 0)
>> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
>> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
>> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
>> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
>> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
>> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
>> +        __u32 region_index;	            /* region index */
>> +        __u64 size;	                    /* size */
>> +        __u32 device_state;                 /* VFIO device state */
>> +        __u64 pending_precopy_only;
>> +        __u64 pending_compatible;
>> +        __u64 pending_postcopy_only;
>> +        __u64 threshold_size;
>> +        __u64 bytes_written;
>> +        __u64 start_addr;
>> +        __u64 pfn_count;
>> +	__u8  dirty_bitmap[];
>> +};
> 
> This feels like it should be separate types for the different calls.
> 
> Also, is there no state to tell the userspace what version of migration
> data we have?  What happens if I try and load a migration state
> from an older driver into a newer one, or the other way around,
> or into a destination card that's not the same as the source.
> 

As I mentioned in other reply, this RFC is mainly focused on core logic
of live migration which include implementation for pre-copy, stop-n-copy
and resume phases, defining device states.

>> +enum {
>> +    VFIO_DEVICE_STATE_NONE,
>> +    VFIO_DEVICE_STATE_RUNNING,
>> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
>> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
>> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
>> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
>> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
>> +};
> 
> That's an interesting merge of QEMU's run-state and it's migration
> state.
> 
> You probably need to define which transitions you take as legal.
>

In reply to Alex, I had listed down allowable state transitions.

Thanks,
Kirti

> Dave
> 
>> +
>> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>  
>>  /**
>> -- 
>> 2.7.0
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
  2018-10-17 20:47       ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-17 23:14         ` Alex Williamson
  -1 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2018-10-17 23:14 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kvm, Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Wang, Zhi A

On Thu, 18 Oct 2018 02:17:19 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/17/2018 4:04 AM, Alex Williamson wrote:
> > On Tue, 16 Oct 2018 23:42:35 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> - Added vfio_device_migration_info structure to use interact with vendor
> >>   driver.
> >> - Different flags are used to get or set migration related information
> >>   from/to vendor driver.
> >> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> >> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> >> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> >> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
> >>   from vendor driver
> >> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
> >>   data to migration region and return number of bytes written in the region
> >> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
> >>   writes to migration region and communicates it to vendor driver with
> >>   this ioctl with this flag.
> >> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
> >>   driver from given start address
> >>
> >> - Added enum for possible device states.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 91 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 3615a269d378..8e9045ed9aa8 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
> >>  
> >>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
> >>  
> >> +/**
> >> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> >> + *                              struct vfio_device_migration_info)  
> > 
> > This is quite a bit more than an "INFO" ioctl.
> >   
> >> + * Flag VFIO_MIGRATION_PROBE:
> >> + *      To query if feature is supported
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_REGION:
> >> + *      To get migration region info
> >> + *      region_index [output] : region index to be used for migration region
> >> + *      size [output]: size of migration region  
> > 
> > Of course the region migration region can describe itself as being used
> > for migration, so this is unnecessary.  The presence of that region
> > could also negate the need for a probe.
> >   
> 
> Yes, that can be done.
> 
> 
> >> + *
> >> + * Flag VFIO_MIGRATION_SET_STATE:
> >> + *      To set device state in vendor driver
> >> + *      device_state [input] : User space app sends device state to vendor
> >> + *           driver on state change  
> > 
> > Valid states are the enum defined below, correct?
> >   
> 
> Yes, that's correct.
> 
> > Does setting STOPNCOPY_ACTIVE stop any state change of the device or is
> > that expected to happen through other means?
> >   
> 
> _PRECOPY_ACTIVE means vCPUs are still running, so VFIO device should
> still remain active.
> _STOPNCOPY_ACTIVE means vCPUs are not running and device should also be
> stopped and copy device's state.

But does that mean _STOPNCOPY_ACTIVE implicitly stops the device?  If
not, how do we do that?

> > What are the allowable state transitions?
> >   
> 
> Normal VM running case:
> _NONE -> _RUNNING

What effect does this transition actually have on the device?
Userspace currently doesn't support migration, so to be compatible with
current userspace the device needs to start in a functional state.

> In case of live migration, at source:
> _RUNNING -> _SETUP -> _PRECOPY_ACTIVE -> _STOPNCOPY_ACTIVE ->
> _SAVE_COMPLETED
> 
> at destination:
> _NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING
> 
> In save VM case:
> _RUNNING -> _SETUP -> _STOPNCOPY_ACTIVE -> _SAVE_COMPLETED
> 
> In case of resuming VM from saved state:
> _NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING
> 
> _FAILED or _CANCELLED can happen in any state.

What are the valid transitions out of _FAILED or _CANCELLED?

As I note below and Dave noted in his reply, mirroring the internal API
of QEMU seems destined to cause us problems.  I think we need to look
at what we're asking of the device in terms of the device, not QEMU.
QEMU's definition of RESUME could change over time and we can't expect
vendor driver authors to not only be QEMU migration experts, but also
to know how the definition of a QEMU migration state has changed over
time.

> > How many bits in flags is a user allowed to set at once?
> >   
> 
> One bit at a time. Probably, I should use enum for flags rather than bits.
> 
> >> + * Flag VFIO_MIGRATION_GET_PENDING:
> >> + *      To get pending bytes yet to be migrated from vendor driver
> >> + *      threshold_size [Input] : threshold of buffer in User space app.
> >> + *      pending_precopy_only [output] : pending data which must be migrated in
> >> + *          precopy phase or in stopped state, in other words - before target
> >> + *          vm start
> >> + *      pending_compatible [output] : pending data which may be migrated in any
> >> + *           phase
> >> + *      pending_postcopy_only [output] : pending data which must be migrated in
> >> + *           postcopy phase or in stopped state, in other words - after source
> >> + *           vm stop
> >> + *      Sum of pending_precopy_only, pending_compatible and
> >> + *      pending_postcopy_only is the whole amount of pending data.  
> > 
> > What's the significance of the provided threshold, are the pending
> > bytes limited to threshold size?  It makes me nervous to define a
> > kernel API in terms of the internal API of a userspace program that can
> > change.  I wonder if it makes sense to define these in terms of the
> > state of the devices, pending initial data, runtime data, post-offline
> > data.
> >   
> 
> Threshold is required, because that will tell size in bytes that user
> space application buffer can accommodate. Driver can copy data less than
> threshold, but copying data more than threshold doesn't make sense
> because user space application won't be able to copy that extra data and
> that data might get overwritten or lost.

I still don't see why the kernel should care about the user's buffer
size, the user can fill multiple buffers from one get buffer request if
they need to.  The relation between this and _GET_BUFFER is awkward and
not intuitive.

> >> + *
> >> + * Flag VFIO_MIGRATION_GET_BUFFER:
> >> + *      On this flag, vendor driver should write data to migration
> >> region and
> >> + *      return number of bytes written in the region.
> >> + *      bytes_written [output] : number of bytes written in
> >> migration buffer by
> >> + *          vendor driver  
> > 
> > Does the data the user gets here depend on the device state set
> > earlier?  For example the user gets pending_precopy_only data while
> > PRECOPY_ACTIVE is the device state and pending_postcopy_only data
> > in STOPNCOPY_ACTIVE?  The user should continue to call GET_BUFFER
> > in a given state until the associated pending field reaches zero?
> > Jumping between the region and ioctl is rather awkward.
> >   
> 
> User gets pending_precopy_only data when device is in PRECOPY_ACTIVE
> state, but each time when user calls GET_BUFFER, pending bytes might
> change.
> VFIO device's driver is producer of data and user/QEMU is consumer of
> data. In pre-copy phase, when vCPUs are still running, driver will try
> to accumulate as much data as possible in this phase, but vCPUs are
> running and user of that device/application in guest is actively using
> that device, then there are chances that during next iteration of
> GET_BUFFER, driver might have more data.
> Even in case of STOPNCOPY_ACTIVE state of device, driver can start
> sending data in parts while a thread in vendor driver can still generate
> data after device has halted, producer and consumer can run in parallel.
> So User has to call GET_BUFFER until pending bytes are returned 0.

Zero bytes \pending\ or zero bytes \written\?  If the latter, what's
the use of the pending fields if they're always subject to change?  How
does getting zero bytes now guarantee that the device is done
producing data in the situation you describe where even a stopped
device can produce data?

> >> + *
> >> + * Flag VFIO_MIGRATION_SET_BUFFER
> >> + *      In migration resume path, user space app writes to migration
> >> region and
> >> + *      communicates it to vendor driver with this ioctl with this
> >> flag.
> >> + *      bytes_written [Input] : number of bytes written in migration
> >> buffer by
> >> + *          user space app.
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
> >> + *      Get bitmap of dirty pages from vendor driver from given
> >> start address.
> >> + *      start_addr [Input] : start address
> >> + *      pfn_count [Input] : Total pfn count from start_addr for
> >> which dirty
> >> + *          bitmap is requested
> >> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
> >> + *           application, vendor driver should return the bitmap
> >> with bits set
> >> + *           only for pages to be marked dirty.
> >> + * Return: 0 on success, -errno on failure.
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +	__u32 argsz;
> >> +	__u32 flags;
> >> +#define VFIO_MIGRATION_PROBE            (1 << 0)
> >> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
> >> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
> >> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
> >> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
> >> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
> >> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
> >> +        __u32 region_index;	            /* region index */
> >> +        __u64 size;	                    /* size */
> >> +        __u32 device_state;                 /* VFIO device state */  
> > 
> > We'd need to swap device_state and size for alignment or else this
> > struct could vary by compiler.
> >   
> 
> Ok. Thanks for pointing that out. I'll change that.
> 
> >> +        __u64 pending_precopy_only;
> >> +        __u64 pending_compatible;
> >> +        __u64 pending_postcopy_only;
> >> +        __u64 threshold_size;
> >> +        __u64 bytes_written;
> >> +        __u64 start_addr;
> >> +        __u64 pfn_count;
> >> +	__u8  dirty_bitmap[];
> >> +};
> >> +
> >> +enum {
> >> +    VFIO_DEVICE_STATE_NONE,
> >> +    VFIO_DEVICE_STATE_RUNNING,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> >> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
> >> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> >> +};
> >> +
> >> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
> >> +  
> > 
> > I'm not entirely sure what this ioctl buys us that we can't do with a
> > region alone.  For example we could define a migration region as:
> > 
> > struct vfio_region_migration {
> > 	__u32 device_state;
> > 	__u32 data_offset;
> > 	__u64 data_size;
> > 	__u64 pending_pre;
> > 	__u64 pending_runtime;
> > 	__u64 pending_post;
> > 	__u64 bytes_available;
> > 	__u64 bytes_written;
> > };
> > 
> > A sparse mmap capability in the region could define whether the data
> > area supports mmap, the above defined registers would only have
> > read/write access.  The user can write device_state the same as the
> > ioctl would set it.  The data_offset and size describe the offset in
> > the region where migration data is relayed and the size of that area.
> > Pending fields are as you have above, though I've omitted the threshold
> > because I don't understand it.  The user reading bytes_available is the
> > same as calling GET_BUFFER, bytes_written the same as SET_BUFFER.    
> 
> Yes, this could be done. Read/write on structure will be same as ioctl,
> i.e. block till expected operation is done.
> 
> > I've
> > left out the dirty bitmap here, as it's a little less obvious to me how
> > it would work.  It could be a stand-alone ioctl, it could be
> > implemented as another region, or it could simply be a different offset
> > within this region.  The dirty bitmap start address could simply be
> > calculated as the offset into that region where a pread begins and the
> > extent is the size of that pread.  With a bit per page, we're only
> > looking at a 32MB region per 1TB of guest physical address space, so it
> > doesn't seem too terrible even if guest memory is sparse.  Maybe the
> > extremes are still problematic though, but if it were part of the same
> > region above we could solve that with another register written by the
> > user to indicate the base offset of the dirty bitmap window.   
> 
> Generally guest memory is sparse and log_sync is called for each
> MemoryRegionSection. Same region can be used to get dirty page bitmap if
> we are going to move to using sparse region.
> I think here you are thinking of worst case where a VM with 1TB of
> memory have one MemoryRegionSection of 1TB.

I'm not thinking in terms of MemoryRegionSection, I'm thinking simply
of guest physical address.  The ioctl you proposed has a start
address (GPA) and number of pfns, so simply a range within the guest
physical address space.  Rather than the user providing a buffer for
the vendor driver to fill in a dirty bitmap, simply provide the
entire dirty bitmap as a section within the region and fill their
reads in on the fly from the offset an range of their pread.  But
then we need to adapt it a little for reality that the guest can have
quite a large physical address space, so perhaps we use more of a
windowed approach where a 1TB range is accessible at any given time
and the user can select which 1TB range via a "register" in the
region.  I don't see how sparse-ness factors in here, those are
simply indexes and offsets that the user never reads and if they do
read them the vendor driver returns that they're clean.  
 
> > For
> > example if the window is 32MB and page size 4k then it can index 1TB
> > and the user would write 0 (default) to the register for pfns from
> > 0 to (1TB - 1), 1 for pfns from 1TB to (2TB - 1), etc.
> >   
> 
> Sorry I didn't get this example. By window here, do you meant sparse
> region size?

No, hopefully the further explanation above makes sense.  Effectively
segmentation, a "register" in the region selects the base offset of the
window in units of the window size.  So the starting GPA of a pread in
the window would be (index_register * effective_window_size) +
offset_into_window.  Where the effective window size would be 1TB for
the case of a 32MB window with 4K page size, and offset into the window
is of course 1 bit per pfn.

> > I don't see that this interface does anything to address the
> > compatibility concerns that were discussed in the previous iterations
> > from Intel though, nor is it documented exactly where and how a vendor
> > driver would indicate a migration failure if it had its own internal
> > consistency or compatibility checks fail.  Thanks,
> >   
> 
> If vendor driver finds any inconsistency in data or if any operation is
> causing failure, vendor driver should return error to the corresponding
> ioctl. ioctl failure is treated as failure of the particular device. If
> one device returns failure during migration process, migration is
> terminated and VM at source resumes.

And as has been discussed in the previous threads, the user has no
visibility into predicting compatibility and blind faith in the vendor
driver for detecting compatibility.  I don't find that acceptable.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface
@ 2018-10-17 23:14         ` Alex Williamson
  0 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2018-10-17 23:14 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, qemu-devel, kvm, Dr. David Alan Gilbert, Juan Quintela,
	Wang, Zhi A

On Thu, 18 Oct 2018 02:17:19 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/17/2018 4:04 AM, Alex Williamson wrote:
> > On Tue, 16 Oct 2018 23:42:35 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> - Added vfio_device_migration_info structure to use interact with vendor
> >>   driver.
> >> - Different flags are used to get or set migration related information
> >>   from/to vendor driver.
> >> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> >> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> >> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> >> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
> >>   from vendor driver
> >> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
> >>   data to migration region and return number of bytes written in the region
> >> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
> >>   writes to migration region and communicates it to vendor driver with
> >>   this ioctl with this flag.
> >> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
> >>   driver from given start address
> >>
> >> - Added enum for possible device states.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 91 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 3615a269d378..8e9045ed9aa8 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
> >>  
> >>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
> >>  
> >> +/**
> >> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> >> + *                              struct vfio_device_migration_info)  
> > 
> > This is quite a bit more than an "INFO" ioctl.
> >   
> >> + * Flag VFIO_MIGRATION_PROBE:
> >> + *      To query if feature is supported
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_REGION:
> >> + *      To get migration region info
> >> + *      region_index [output] : region index to be used for migration region
> >> + *      size [output]: size of migration region  
> > 
> > Of course the region migration region can describe itself as being used
> > for migration, so this is unnecessary.  The presence of that region
> > could also negate the need for a probe.
> >   
> 
> Yes, that can be done.
> 
> 
> >> + *
> >> + * Flag VFIO_MIGRATION_SET_STATE:
> >> + *      To set device state in vendor driver
> >> + *      device_state [input] : User space app sends device state to vendor
> >> + *           driver on state change  
> > 
> > Valid states are the enum defined below, correct?
> >   
> 
> Yes, that's correct.
> 
> > Does setting STOPNCOPY_ACTIVE stop any state change of the device or is
> > that expected to happen through other means?
> >   
> 
> _PRECOPY_ACTIVE means vCPUs are still running, so VFIO device should
> still remain active.
> _STOPNCOPY_ACTIVE means vCPUs are not running and device should also be
> stopped and copy device's state.

But does that mean _STOPNCOPY_ACTIVE implicitly stops the device?  If
not, how do we do that?

> > What are the allowable state transitions?
> >   
> 
> Normal VM running case:
> _NONE -> _RUNNING

What effect does this transition actually have on the device?
Userspace currently doesn't support migration, so to be compatible with
current userspace the device needs to start in a functional state.

> In case of live migration, at source:
> _RUNNING -> _SETUP -> _PRECOPY_ACTIVE -> _STOPNCOPY_ACTIVE ->
> _SAVE_COMPLETED
> 
> at destination:
> _NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING
> 
> In save VM case:
> _RUNNING -> _SETUP -> _STOPNCOPY_ACTIVE -> _SAVE_COMPLETED
> 
> In case of resuming VM from saved state:
> _NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING
> 
> _FAILED or _CANCELLED can happen in any state.

What are the valid transitions out of _FAILED or _CANCELLED?

As I note below and Dave noted in his reply, mirroring the internal API
of QEMU seems destined to cause us problems.  I think we need to look
at what we're asking of the device in terms of the device, not QEMU.
QEMU's definition of RESUME could change over time and we can't expect
vendor driver authors to not only be QEMU migration experts, but also
to know how the definition of a QEMU migration state has changed over
time.

> > How many bits in flags is a user allowed to set at once?
> >   
> 
> One bit at a time. Probably, I should use enum for flags rather than bits.
> 
> >> + * Flag VFIO_MIGRATION_GET_PENDING:
> >> + *      To get pending bytes yet to be migrated from vendor driver
> >> + *      threshold_size [Input] : threshold of buffer in User space app.
> >> + *      pending_precopy_only [output] : pending data which must be migrated in
> >> + *          precopy phase or in stopped state, in other words - before target
> >> + *          vm start
> >> + *      pending_compatible [output] : pending data which may be migrated in any
> >> + *           phase
> >> + *      pending_postcopy_only [output] : pending data which must be migrated in
> >> + *           postcopy phase or in stopped state, in other words - after source
> >> + *           vm stop
> >> + *      Sum of pending_precopy_only, pending_compatible and
> >> + *      pending_postcopy_only is the whole amount of pending data.  
> > 
> > What's the significance of the provided threshold, are the pending
> > bytes limited to threshold size?  It makes me nervous to define a
> > kernel API in terms of the internal API of a userspace program that can
> > change.  I wonder if it makes sense to define these in terms of the
> > state of the devices, pending initial data, runtime data, post-offline
> > data.
> >   
> 
> Threshold is required, because that will tell size in bytes that user
> space application buffer can accommodate. Driver can copy data less than
> threshold, but copying data more than threshold doesn't make sense
> because user space application won't be able to copy that extra data and
> that data might get overwritten or lost.

I still don't see why the kernel should care about the user's buffer
size, the user can fill multiple buffers from one get buffer request if
they need to.  The relation between this and _GET_BUFFER is awkward and
not intuitive.

> >> + *
> >> + * Flag VFIO_MIGRATION_GET_BUFFER:
> >> + *      On this flag, vendor driver should write data to migration
> >> region and
> >> + *      return number of bytes written in the region.
> >> + *      bytes_written [output] : number of bytes written in
> >> migration buffer by
> >> + *          vendor driver  
> > 
> > Does the data the user gets here depend on the device state set
> > earlier?  For example the user gets pending_precopy_only data while
> > PRECOPY_ACTIVE is the device state and pending_postcopy_only data
> > in STOPNCOPY_ACTIVE?  The user should continue to call GET_BUFFER
> > in a given state until the associated pending field reaches zero?
> > Jumping between the region and ioctl is rather awkward.
> >   
> 
> User gets pending_precopy_only data when device is in PRECOPY_ACTIVE
> state, but each time when user calls GET_BUFFER, pending bytes might
> change.
> VFIO device's driver is producer of data and user/QEMU is consumer of
> data. In pre-copy phase, when vCPUs are still running, driver will try
> to accumulate as much data as possible in this phase, but vCPUs are
> running and user of that device/application in guest is actively using
> that device, then there are chances that during next iteration of
> GET_BUFFER, driver might have more data.
> Even in case of STOPNCOPY_ACTIVE state of device, driver can start
> sending data in parts while a thread in vendor driver can still generate
> data after device has halted, producer and consumer can run in parallel.
> So User has to call GET_BUFFER until pending bytes are returned 0.

Zero bytes \pending\ or zero bytes \written\?  If the latter, what's
the use of the pending fields if they're always subject to change?  How
does getting zero bytes now guarantee that the device is done
producing data in the situation you describe where even a stopped
device can produce data?

> >> + *
> >> + * Flag VFIO_MIGRATION_SET_BUFFER
> >> + *      In migration resume path, user space app writes to migration
> >> region and
> >> + *      communicates it to vendor driver with this ioctl with this
> >> flag.
> >> + *      bytes_written [Input] : number of bytes written in migration
> >> buffer by
> >> + *          user space app.
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
> >> + *      Get bitmap of dirty pages from vendor driver from given
> >> start address.
> >> + *      start_addr [Input] : start address
> >> + *      pfn_count [Input] : Total pfn count from start_addr for
> >> which dirty
> >> + *          bitmap is requested
> >> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
> >> + *           application, vendor driver should return the bitmap
> >> with bits set
> >> + *           only for pages to be marked dirty.
> >> + * Return: 0 on success, -errno on failure.
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +	__u32 argsz;
> >> +	__u32 flags;
> >> +#define VFIO_MIGRATION_PROBE            (1 << 0)
> >> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
> >> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
> >> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
> >> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
> >> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
> >> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
> >> +        __u32 region_index;	            /* region index */
> >> +        __u64 size;	                    /* size */
> >> +        __u32 device_state;                 /* VFIO device state */  
> > 
> > We'd need to swap device_state and size for alignment or else this
> > struct could vary by compiler.
> >   
> 
> Ok. Thanks for pointing that out. I'll change that.
> 
> >> +        __u64 pending_precopy_only;
> >> +        __u64 pending_compatible;
> >> +        __u64 pending_postcopy_only;
> >> +        __u64 threshold_size;
> >> +        __u64 bytes_written;
> >> +        __u64 start_addr;
> >> +        __u64 pfn_count;
> >> +	__u8  dirty_bitmap[];
> >> +};
> >> +
> >> +enum {
> >> +    VFIO_DEVICE_STATE_NONE,
> >> +    VFIO_DEVICE_STATE_RUNNING,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> >> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
> >> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> >> +};
> >> +
> >> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
> >> +  
> > 
> > I'm not entirely sure what this ioctl buys us that we can't do with a
> > region alone.  For example we could define a migration region as:
> > 
> > struct vfio_region_migration {
> > 	__u32 device_state;
> > 	__u32 data_offset;
> > 	__u64 data_size;
> > 	__u64 pending_pre;
> > 	__u64 pending_runtime;
> > 	__u64 pending_post;
> > 	__u64 bytes_available;
> > 	__u64 bytes_written;
> > };
> > 
> > A sparse mmap capability in the region could define whether the data
> > area supports mmap, the above defined registers would only have
> > read/write access.  The user can write device_state the same as the
> > ioctl would set it.  The data_offset and size describe the offset in
> > the region where migration data is relayed and the size of that area.
> > Pending fields are as you have above, though I've omitted the threshold
> > because I don't understand it.  The user reading bytes_available is the
> > same as calling GET_BUFFER, bytes_written the same as SET_BUFFER.    
> 
> Yes, this could be done. Read/write on structure will be same as ioctl,
> i.e. block till expected operation is done.
> 
> > I've
> > left out the dirty bitmap here, as it's a little less obvious to me how
> > it would work.  It could be a stand-alone ioctl, it could be
> > implemented as another region, or it could simply be a different offset
> > within this region.  The dirty bitmap start address could simply be
> > calculated as the offset into that region where a pread begins and the
> > extent is the size of that pread.  With a bit per page, we're only
> > looking at a 32MB region per 1TB of guest physical address space, so it
> > doesn't seem too terrible even if guest memory is sparse.  Maybe the
> > extremes are still problematic though, but if it were part of the same
> > region above we could solve that with another register written by the
> > user to indicate the base offset of the dirty bitmap window.   
> 
> Generally guest memory is sparse and log_sync is called for each
> MemoryRegionSection. Same region can be used to get dirty page bitmap if
> we are going to move to using sparse region.
> I think here you are thinking of worst case where a VM with 1TB of
> memory have one MemoryRegionSection of 1TB.

I'm not thinking in terms of MemoryRegionSection, I'm thinking simply
of guest physical address.  The ioctl you proposed has a start
address (GPA) and number of pfns, so simply a range within the guest
physical address space.  Rather than the user providing a buffer for
the vendor driver to fill in a dirty bitmap, simply provide the
entire dirty bitmap as a section within the region and fill their
reads in on the fly from the offset an range of their pread.  But
then we need to adapt it a little for reality that the guest can have
quite a large physical address space, so perhaps we use more of a
windowed approach where a 1TB range is accessible at any given time
and the user can select which 1TB range via a "register" in the
region.  I don't see how sparse-ness factors in here, those are
simply indexes and offsets that the user never reads and if they do
read them the vendor driver returns that they're clean.  
 
> > For
> > example if the window is 32MB and page size 4k then it can index 1TB
> > and the user would write 0 (default) to the register for pfns from
> > 0 to (1TB - 1), 1 for pfns from 1TB to (2TB - 1), etc.
> >   
> 
> Sorry I didn't get this example. By window here, do you meant sparse
> region size?

No, hopefully the further explanation above makes sense.  Effectively
segmentation, a "register" in the region selects the base offset of the
window in units of the window size.  So the starting GPA of a pread in
the window would be (index_register * effective_window_size) +
offset_into_window.  Where the effective window size would be 1TB for
the case of a 32MB window with 4K page size, and offset into the window
is of course 1 bit per pfn.

> > I don't see that this interface does anything to address the
> > compatibility concerns that were discussed in the previous iterations
> > from Intel though, nor is it documented exactly where and how a vendor
> > driver would indicate a migration failure if it had its own internal
> > consistency or compatibility checks fail.  Thanks,
> >   
> 
> If vendor driver finds any inconsistency in data or if any operation is
> causing failure, vendor driver should return error to the corresponding
> ioctl. ioctl failure is treated as failure of the particular device. If
> one device returns failure during migration process, migration is
> terminated and VM at source resumes.

And as has been discussed in the previous threads, the user has no
visibility into predicting compatibility and blind faith in the vendor
driver for detecting compatibility.  I don't find that acceptable.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
  2018-10-17 20:47       ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-18  1:49         ` Gonglei (Arei)
  -1 siblings, 0 replies; 36+ messages in thread
From: Gonglei (Arei) @ 2018-10-18  1:49 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: cjia, kvm, Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Yulei Zhang, Wang, Zhi A


> -----Original Message-----
> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> Behalf Of Kirti Wankhede
> Sent: Thursday, October 18, 2018 4:47 AM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: cjia@nvidia.com; qemu-devel@nongnu.org; kvm@vger.kernel.org; Yulei
> Zhang <yulei.zhang@intel.com>; Dr. David Alan Gilbert
> <dgilbert@redhat.com>; Juan Quintela <quintela@redhat.com>; Wang, Zhi A
> <zhi.a.wang@intel.com>
> Subject: Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
> 
> 
> On 10/17/2018 4:04 AM, Alex Williamson wrote:
> > On Tue, 16 Oct 2018 23:42:35 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >
> >> - Added vfio_device_migration_info structure to use interact with vendor
> >>   driver.
> >> - Different flags are used to get or set migration related information
> >>   from/to vendor driver.
> >> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> >> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> >> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> >> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be
> migrated
> >>   from vendor driver
> >> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should
> write
> >>   data to migration region and return number of bytes written in the
> region
> >> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space
> app
> >>   writes to migration region and communicates it to vendor driver with
> >>   this ioctl with this flag.
> >> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from
> vendor
> >>   driver from given start address
> >>
> >> - Added enum for possible device states.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 91
> ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 91 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 3615a269d378..8e9045ed9aa8 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
> >>
> >>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE +
> 16)
> >>
> >> +/**
> >> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> >> + *                              struct vfio_device_migration_info)
> >
> > This is quite a bit more than an "INFO" ioctl.
> >
> >> + * Flag VFIO_MIGRATION_PROBE:
> >> + *      To query if feature is supported
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_REGION:
> >> + *      To get migration region info
> >> + *      region_index [output] : region index to be used for migration
> region
> >> + *      size [output]: size of migration region
> >
> > Of course the region migration region can describe itself as being used
> > for migration, so this is unnecessary.  The presence of that region
> > could also negate the need for a probe.
> >
> 
> Yes, that can be done.
> 
> 
> >> + *
> >> + * Flag VFIO_MIGRATION_SET_STATE:
> >> + *      To set device state in vendor driver
> >> + *      device_state [input] : User space app sends device state to
> vendor
> >> + *           driver on state change
> >
> > Valid states are the enum defined below, correct?
> >
> 
> Yes, that's correct.
> 
> > Does setting STOPNCOPY_ACTIVE stop any state change of the device or is
> > that expected to happen through other means?
> >
> 
> _PRECOPY_ACTIVE means vCPUs are still running, so VFIO device should
> still remain active.
> _STOPNCOPY_ACTIVE means vCPUs are not running and device should also be
> stopped and copy device's state.
> 
> > What are the allowable state transitions?
> >
> 
> Normal VM running case:
> _NONE -> _RUNNING
> 
> In case of live migration, at source:
> _RUNNING -> _SETUP -> _PRECOPY_ACTIVE -> _STOPNCOPY_ACTIVE ->
> _SAVE_COMPLETED
> 
> at destination:
> _NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING
> 
> In save VM case:
> _RUNNING -> _SETUP -> _STOPNCOPY_ACTIVE -> _SAVE_COMPLETED
> 
> In case of resuming VM from saved state:
> _NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING
> 
> _FAILED or _CANCELLED can happen in any state.
> 
> > How many bits in flags is a user allowed to set at once?
> >
> 
> One bit at a time. Probably, I should use enum for flags rather than bits.
> 
> >> + * Flag VFIO_MIGRATION_GET_PENDING:
> >> + *      To get pending bytes yet to be migrated from vendor driver
> >> + *      threshold_size [Input] : threshold of buffer in User space app.
> >> + *      pending_precopy_only [output] : pending data which must be
> migrated in
> >> + *          precopy phase or in stopped state, in other words - before
> target
> >> + *          vm start
> >> + *      pending_compatible [output] : pending data which may be
> migrated in any
> >> + *           phase
> >> + *      pending_postcopy_only [output] : pending data which must be
> migrated in
> >> + *           postcopy phase or in stopped state, in other words - after
> source
> >> + *           vm stop
> >> + *      Sum of pending_precopy_only, pending_compatible and
> >> + *      pending_postcopy_only is the whole amount of pending data.
> >
> > What's the significance of the provided threshold, are the pending
> > bytes limited to threshold size?  It makes me nervous to define a
> > kernel API in terms of the internal API of a userspace program that can
> > change.  I wonder if it makes sense to define these in terms of the
> > state of the devices, pending initial data, runtime data, post-offline
> > data.
> >
> 
> Threshold is required, because that will tell size in bytes that user
> space application buffer can accommodate. Driver can copy data less than
> threshold, but copying data more than threshold doesn't make sense
> because user space application won't be able to copy that extra data and
> that data might get overwritten or lost.
> 
> 
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_BUFFER:
> >> + *      On this flag, vendor driver should write data to migration
> >> region and
> >> + *      return number of bytes written in the region.
> >> + *      bytes_written [output] : number of bytes written in
> >> migration buffer by
> >> + *          vendor driver
> >
> > Does the data the user gets here depend on the device state set
> > earlier?  For example the user gets pending_precopy_only data while
> > PRECOPY_ACTIVE is the device state and pending_postcopy_only data
> > in STOPNCOPY_ACTIVE?  The user should continue to call GET_BUFFER
> > in a given state until the associated pending field reaches zero?
> > Jumping between the region and ioctl is rather awkward.
> >
> 
> User gets pending_precopy_only data when device is in PRECOPY_ACTIVE
> state, but each time when user calls GET_BUFFER, pending bytes might
> change.
> VFIO device's driver is producer of data and user/QEMU is consumer of
> data. In pre-copy phase, when vCPUs are still running, driver will try
> to accumulate as much data as possible in this phase, but vCPUs are
> running and user of that device/application in guest is actively using
> that device, then there are chances that during next iteration of
> GET_BUFFER, driver might have more data.
> Even in case of STOPNCOPY_ACTIVE state of device, driver can start
> sending data in parts while a thread in vendor driver can still generate
> data after device has halted, producer and consumer can run in parallel.
> So User has to call GET_BUFFER until pending bytes are returned 0.
> 
How to understand "the driver still generate data after device has halted"?
Do interrupts still be generated after device halted? If so, it will lost interrupt
information in pi_desc.pir .

Thanks,
-Gonglei

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface
@ 2018-10-18  1:49         ` Gonglei (Arei)
  0 siblings, 0 replies; 36+ messages in thread
From: Gonglei (Arei) @ 2018-10-18  1:49 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: cjia, qemu-devel, kvm, Yulei Zhang, Dr. David Alan Gilbert,
	Juan Quintela, Wang, Zhi A


> -----Original Message-----
> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> Behalf Of Kirti Wankhede
> Sent: Thursday, October 18, 2018 4:47 AM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: cjia@nvidia.com; qemu-devel@nongnu.org; kvm@vger.kernel.org; Yulei
> Zhang <yulei.zhang@intel.com>; Dr. David Alan Gilbert
> <dgilbert@redhat.com>; Juan Quintela <quintela@redhat.com>; Wang, Zhi A
> <zhi.a.wang@intel.com>
> Subject: Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
> 
> 
> On 10/17/2018 4:04 AM, Alex Williamson wrote:
> > On Tue, 16 Oct 2018 23:42:35 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >
> >> - Added vfio_device_migration_info structure to use interact with vendor
> >>   driver.
> >> - Different flags are used to get or set migration related information
> >>   from/to vendor driver.
> >> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> >> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> >> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> >> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be
> migrated
> >>   from vendor driver
> >> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should
> write
> >>   data to migration region and return number of bytes written in the
> region
> >> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space
> app
> >>   writes to migration region and communicates it to vendor driver with
> >>   this ioctl with this flag.
> >> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from
> vendor
> >>   driver from given start address
> >>
> >> - Added enum for possible device states.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 91
> ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 91 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 3615a269d378..8e9045ed9aa8 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
> >>
> >>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE +
> 16)
> >>
> >> +/**
> >> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> >> + *                              struct vfio_device_migration_info)
> >
> > This is quite a bit more than an "INFO" ioctl.
> >
> >> + * Flag VFIO_MIGRATION_PROBE:
> >> + *      To query if feature is supported
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_REGION:
> >> + *      To get migration region info
> >> + *      region_index [output] : region index to be used for migration
> region
> >> + *      size [output]: size of migration region
> >
> > Of course the region migration region can describe itself as being used
> > for migration, so this is unnecessary.  The presence of that region
> > could also negate the need for a probe.
> >
> 
> Yes, that can be done.
> 
> 
> >> + *
> >> + * Flag VFIO_MIGRATION_SET_STATE:
> >> + *      To set device state in vendor driver
> >> + *      device_state [input] : User space app sends device state to
> vendor
> >> + *           driver on state change
> >
> > Valid states are the enum defined below, correct?
> >
> 
> Yes, that's correct.
> 
> > Does setting STOPNCOPY_ACTIVE stop any state change of the device or is
> > that expected to happen through other means?
> >
> 
> _PRECOPY_ACTIVE means vCPUs are still running, so VFIO device should
> still remain active.
> _STOPNCOPY_ACTIVE means vCPUs are not running and device should also be
> stopped and copy device's state.
> 
> > What are the allowable state transitions?
> >
> 
> Normal VM running case:
> _NONE -> _RUNNING
> 
> In case of live migration, at source:
> _RUNNING -> _SETUP -> _PRECOPY_ACTIVE -> _STOPNCOPY_ACTIVE ->
> _SAVE_COMPLETED
> 
> at destination:
> _NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING
> 
> In save VM case:
> _RUNNING -> _SETUP -> _STOPNCOPY_ACTIVE -> _SAVE_COMPLETED
> 
> In case of resuming VM from saved state:
> _NONE -> _SETUP -> _RESUME -> _RESUME_COMPLETE -> _RUNNING
> 
> _FAILED or _CANCELLED can happen in any state.
> 
> > How many bits in flags is a user allowed to set at once?
> >
> 
> One bit at a time. Probably, I should use enum for flags rather than bits.
> 
> >> + * Flag VFIO_MIGRATION_GET_PENDING:
> >> + *      To get pending bytes yet to be migrated from vendor driver
> >> + *      threshold_size [Input] : threshold of buffer in User space app.
> >> + *      pending_precopy_only [output] : pending data which must be
> migrated in
> >> + *          precopy phase or in stopped state, in other words - before
> target
> >> + *          vm start
> >> + *      pending_compatible [output] : pending data which may be
> migrated in any
> >> + *           phase
> >> + *      pending_postcopy_only [output] : pending data which must be
> migrated in
> >> + *           postcopy phase or in stopped state, in other words - after
> source
> >> + *           vm stop
> >> + *      Sum of pending_precopy_only, pending_compatible and
> >> + *      pending_postcopy_only is the whole amount of pending data.
> >
> > What's the significance of the provided threshold, are the pending
> > bytes limited to threshold size?  It makes me nervous to define a
> > kernel API in terms of the internal API of a userspace program that can
> > change.  I wonder if it makes sense to define these in terms of the
> > state of the devices, pending initial data, runtime data, post-offline
> > data.
> >
> 
> Threshold is required, because that will tell size in bytes that user
> space application buffer can accommodate. Driver can copy data less than
> threshold, but copying data more than threshold doesn't make sense
> because user space application won't be able to copy that extra data and
> that data might get overwritten or lost.
> 
> 
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_BUFFER:
> >> + *      On this flag, vendor driver should write data to migration
> >> region and
> >> + *      return number of bytes written in the region.
> >> + *      bytes_written [output] : number of bytes written in
> >> migration buffer by
> >> + *          vendor driver
> >
> > Does the data the user gets here depend on the device state set
> > earlier?  For example the user gets pending_precopy_only data while
> > PRECOPY_ACTIVE is the device state and pending_postcopy_only data
> > in STOPNCOPY_ACTIVE?  The user should continue to call GET_BUFFER
> > in a given state until the associated pending field reaches zero?
> > Jumping between the region and ioctl is rather awkward.
> >
> 
> User gets pending_precopy_only data when device is in PRECOPY_ACTIVE
> state, but each time when user calls GET_BUFFER, pending bytes might
> change.
> VFIO device's driver is producer of data and user/QEMU is consumer of
> data. In pre-copy phase, when vCPUs are still running, driver will try
> to accumulate as much data as possible in this phase, but vCPUs are
> running and user of that device/application in guest is actively using
> that device, then there are chances that during next iteration of
> GET_BUFFER, driver might have more data.
> Even in case of STOPNCOPY_ACTIVE state of device, driver can start
> sending data in parts while a thread in vendor driver can still generate
> data after device has halted, producer and consumer can run in parallel.
> So User has to call GET_BUFFER until pending bytes are returned 0.
> 
How to understand "the driver still generate data after device has halted"?
Do interrupts still be generated after device halted? If so, it will lost interrupt
information in pi_desc.pir .

Thanks,
-Gonglei

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 0/4] Add migration support for VFIO device
  2018-10-16 18:12 ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-18  2:41   ` Tian, Kevin
  -1 siblings, 0 replies; 36+ messages in thread
From: Tian, Kevin @ 2018-10-18  2:41 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia; +Cc: qemu-devel, kvm

> From: Kirti Wankhede
> Sent: Wednesday, October 17, 2018 2:13 AM
> 
> Add migration support for VFIO device
> 
> This Patch set include patches as below:
> - Define KABI for VFIO device for migration support.
> - Generic migration functionality for VFIO device.
>   * This patch set adds functionality only for PCI devices, but can be
>     extended to other VFIO devices.
>   * Added all the basic functions required for pre-copy, stop-and-copy and
>     resume phases of migration.
>   * Added state change notifier and from that notifier function, VFIO
>     device's state changed is conveyed to VFIO vendor driver.
>   * During save setup phase and resume/load setup phase, migration region
>     is queried from vendor driver and is mmaped by QEMU. This region is
>     used to read/write data from and to vendor driver.
>   * .save_live_pending, .save_live_iterate and .is_active_iterate are
>     implemented to use QEMU's functionality of iteration during pre-copy
>     phase.
>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>     iteration to read data from vendor driver is implemented till pending
>     bytes returned by vendor driver are not zero.
>   * .save_cleanup and .load_cleanup are implemented to unmap migration
>     region that was setup duing setup phase.
>   * Added function to get dirty pages bitmap from vendor driver.
> - Add vfio_listerner_log_sync to mark dirty pages.
> - Make VFIO PCI device migration capable.
> 

I didn't see a kernel part change implementing the new KABI. If there is,
can you point out?

btw how is this work related to previous effort on adding live migration 
to VFIO?

https://lists.gnu.org/archive/html/qemu-devel/2018-04/msg01134.html
https://www.spinics.net/linux/fedora/libvir/msg170669.html 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 0/4] Add migration support for VFIO device
@ 2018-10-18  2:41   ` Tian, Kevin
  0 siblings, 0 replies; 36+ messages in thread
From: Tian, Kevin @ 2018-10-18  2:41 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia; +Cc: qemu-devel, kvm

> From: Kirti Wankhede
> Sent: Wednesday, October 17, 2018 2:13 AM
> 
> Add migration support for VFIO device
> 
> This Patch set include patches as below:
> - Define KABI for VFIO device for migration support.
> - Generic migration functionality for VFIO device.
>   * This patch set adds functionality only for PCI devices, but can be
>     extended to other VFIO devices.
>   * Added all the basic functions required for pre-copy, stop-and-copy and
>     resume phases of migration.
>   * Added state change notifier and from that notifier function, VFIO
>     device's state changed is conveyed to VFIO vendor driver.
>   * During save setup phase and resume/load setup phase, migration region
>     is queried from vendor driver and is mmaped by QEMU. This region is
>     used to read/write data from and to vendor driver.
>   * .save_live_pending, .save_live_iterate and .is_active_iterate are
>     implemented to use QEMU's functionality of iteration during pre-copy
>     phase.
>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>     iteration to read data from vendor driver is implemented till pending
>     bytes returned by vendor driver are not zero.
>   * .save_cleanup and .load_cleanup are implemented to unmap migration
>     region that was setup duing setup phase.
>   * Added function to get dirty pages bitmap from vendor driver.
> - Add vfio_listerner_log_sync to mark dirty pages.
> - Make VFIO PCI device migration capable.
> 

I didn't see a kernel part change implementing the new KABI. If there is,
can you point out?

btw how is this work related to previous effort on adding live migration 
to VFIO?

https://lists.gnu.org/archive/html/qemu-devel/2018-04/msg01134.html
https://www.spinics.net/linux/fedora/libvir/msg170669.html 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 1/4] VFIO KABI for migration interface
  2018-10-17 21:17       ` [Qemu-devel] " Kirti Wankhede
@ 2018-10-18  9:23         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 36+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-18  9:23 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: alex.williamson, cjia, qemu-devel, kvm, quintela

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 10/17/2018 3:39 PM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >> - Added vfio_device_migration_info structure to use interact with vendor
> >>   driver.
> >> - Different flags are used to get or set migration related information
> >>   from/to vendor driver.
> >> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> >> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> >> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> >> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
> >>   from vendor driver
> >> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
> >>   data to migration region and return number of bytes written in the region
> >> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
> >>   writes to migration region and communicates it to vendor driver with
> >>   this ioctl with this flag.
> >> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
> >>   driver from given start address
> >>
> >> - Added enum for possible device states.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 91 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 3615a269d378..8e9045ed9aa8 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
> >>  
> >>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
> >>  
> >> +/**
> >> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> >> + *                              struct vfio_device_migration_info)
> > 
> > This is a little odd; what you really have here is 7 different
> > operations that can be performed; they're independent operations all
> > relating to part of migration; so instead of a 'flag' it's more of an
> > operation type, and there's no reason to make them individual bit flags
> > - for edxample there's no reason you'd want to OR together
> > MIGRATION_GET_REGION and MIGRATION_GET_PENDING - they're just
> > independent.
> > (Similarly you could just make them 7 ioctls)
> > 
> 
> Right, all flags are independent commands. But I tried to use one ioctl
> number.
> 
> > I guess what you're trying to do here is a direct 1-1 mapping of qemu's
> > struct SaveVMHandlers interface to devices.
> > That's not necessarily a bad idea, but remember it's not a stable
> > interface, so QEMU will change it over time and you'll have to then
> > figure out how to shim these interfaces to it, so it's worth keeping
> > that in mind, just in case you can make these interfaces more general.
> > 
> 
> Alex suggested using sparse region instead of ioctl, I'm making note of
> your point to define structures when implementing this interface using
> sparse mmap region.
> 
> 
> >> + * Flag VFIO_MIGRATION_PROBE:
> >> + *      To query if feature is supported
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_REGION:
> >> + *      To get migration region info
> >> + *      region_index [output] : region index to be used for migration region
> >> + *      size [output]: size of migration region
> >> + *
> >> + * Flag VFIO_MIGRATION_SET_STATE:
> >> + *      To set device state in vendor driver
> >> + *      device_state [input] : User space app sends device state to vendor
> >> + *           driver on state change
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_PENDING:
> >> + *      To get pending bytes yet to be migrated from vendor driver
> >> + *      threshold_size [Input] : threshold of buffer in User space app.
> >> + *      pending_precopy_only [output] : pending data which must be migrated in
> >> + *          precopy phase or in stopped state, in other words - before target
> >> + *          vm start
> >> + *      pending_compatible [output] : pending data which may be migrated in any
> >> + *           phase
> >> + *      pending_postcopy_only [output] : pending data which must be migrated in
> >> + *           postcopy phase or in stopped state, in other words - after source
> >> + *           vm stop
> >> + *      Sum of pending_precopy_only, pending_compatible and
> >> + *      pending_postcopy_only is the whole amount of pending data.
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_BUFFER:
> >> + *      On this flag, vendor driver should write data to migration region and
> >> + *      return number of bytes written in the region.
> >> + *      bytes_written [output] : number of bytes written in migration buffer by
> >> + *          vendor driver
> >> + *
> >> + * Flag VFIO_MIGRATION_SET_BUFFER
> >> + *      In migration resume path, user space app writes to migration region and
> >> + *      communicates it to vendor driver with this ioctl with this flag.
> >> + *      bytes_written [Input] : number of bytes written in migration buffer by
> >> + *          user space app.
> > 
> > I guess we call getbuffer/setbuffer multiple times.  Is there anything
> > that needs to be passed to get the data to line up (e.g. offset into the
> > data)
> > 
> 
> I think, vendor driver can keep track of offset as it knows what data
> copied and what is remaining.

OK.
I'm not sure if your 'threshold size' definition matches qemu's; QEMU's
idea is a threshold below which we should stop the precopy phase.

I'm also not too sure how your GET_BUFFER works;  if GET_BUFFER returns
a full buffer, I guess we have to call it again to get more data - how
do I know I need to call it again?

(Some of these things would be clearer if you posted an RFC for the QEMU
side as well).

> 
> >> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
> >> + *      Get bitmap of dirty pages from vendor driver from given start address.
> >> + *      start_addr [Input] : start address
> >> + *      pfn_count [Input] : Total pfn count from start_addr for which dirty
> >> + *          bitmap is requested
> >> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
> >> + *           application, vendor driver should return the bitmap with bits set
> >> + *           only for pages to be marked dirty.
> >> + * Return: 0 on success, -errno on failure.
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +	__u32 argsz;
> >> +	__u32 flags;
> >> +#define VFIO_MIGRATION_PROBE            (1 << 0)
> >> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
> >> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
> >> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
> >> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
> >> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
> >> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
> >> +        __u32 region_index;	            /* region index */
> >> +        __u64 size;	                    /* size */
> >> +        __u32 device_state;                 /* VFIO device state */
> >> +        __u64 pending_precopy_only;
> >> +        __u64 pending_compatible;
> >> +        __u64 pending_postcopy_only;
> >> +        __u64 threshold_size;
> >> +        __u64 bytes_written;
> >> +        __u64 start_addr;
> >> +        __u64 pfn_count;
> >> +	__u8  dirty_bitmap[];
> >> +};
> > 
> > This feels like it should be separate types for the different calls.
> > 
> > Also, is there no state to tell the userspace what version of migration
> > data we have?  What happens if I try and load a migration state
> > from an older driver into a newer one, or the other way around,
> > or into a destination card that's not the same as the source.
> > 
> 
> As I mentioned in other reply, this RFC is mainly focused on core logic
> of live migration which include implementation for pre-copy, stop-n-copy
> and resume phases, defining device states.
> 
> >> +enum {
> >> +    VFIO_DEVICE_STATE_NONE,
> >> +    VFIO_DEVICE_STATE_RUNNING,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> >> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
> >> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> >> +};
> > 
> > That's an interesting merge of QEMU's run-state and it's migration
> > state.
> > 
> > You probably need to define which transitions you take as legal.
> >
> 
> In reply to Alex, I had listed down allowable state transitions.

Yes; we regularly find that we have to add extra transitions we hadn't
thought of.

Can you think of anyway of making postcopy work with your devices?
It's tricky because we'd have to find a way to stop your devices
accessing memory that hasn't arrived yet.

Dave

> Thanks,
> Kirti
> 
> > Dave
> > 
> >> +
> >> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
> >> +
> >>  /* -------- API for Type1 VFIO IOMMU -------- */
> >>  
> >>  /**
> >> -- 
> >> 2.7.0
> >>
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface
@ 2018-10-18  9:23         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 36+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-18  9:23 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: alex.williamson, cjia, qemu-devel, kvm, quintela

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 10/17/2018 3:39 PM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >> - Added vfio_device_migration_info structure to use interact with vendor
> >>   driver.
> >> - Different flags are used to get or set migration related information
> >>   from/to vendor driver.
> >> Flag VFIO_MIGRATION_PROBE: To query if feature is supported
> >> Flag VFIO_MIGRATION_GET_REGION: To get migration region info
> >> Flag VFIO_MIGRATION_SET_STATE: To convey device state in vendor driver
> >> Flag VFIO_MIGRATION_GET_PENDING: To get pending bytes yet to be migrated
> >>   from vendor driver
> >> Flag VFIO_MIGRATION_GET_BUFFER: On this flag, vendor driver should write
> >>   data to migration region and return number of bytes written in the region
> >> Flag VFIO_MIGRATION_SET_BUFFER: In migration resume path, user space app
> >>   writes to migration region and communicates it to vendor driver with
> >>   this ioctl with this flag.
> >> Flag VFIO_MIGRATION_GET_DIRTY_PFNS: Get bitmap of dirty pages from vendor
> >>   driver from given start address
> >>
> >> - Added enum for possible device states.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 91 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 3615a269d378..8e9045ed9aa8 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -602,6 +602,97 @@ struct vfio_device_ioeventfd {
> >>  
> >>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
> >>  
> >> +/**
> >> + * VFIO_DEVICE_MIGRATION_INFO - _IOW(VFIO_TYPE, VFIO_BASE + 17,
> >> + *                              struct vfio_device_migration_info)
> > 
> > This is a little odd; what you really have here is 7 different
> > operations that can be performed; they're independent operations all
> > relating to part of migration; so instead of a 'flag' it's more of an
> > operation type, and there's no reason to make them individual bit flags
> > - for edxample there's no reason you'd want to OR together
> > MIGRATION_GET_REGION and MIGRATION_GET_PENDING - they're just
> > independent.
> > (Similarly you could just make them 7 ioctls)
> > 
> 
> Right, all flags are independent commands. But I tried to use one ioctl
> number.
> 
> > I guess what you're trying to do here is a direct 1-1 mapping of qemu's
> > struct SaveVMHandlers interface to devices.
> > That's not necessarily a bad idea, but remember it's not a stable
> > interface, so QEMU will change it over time and you'll have to then
> > figure out how to shim these interfaces to it, so it's worth keeping
> > that in mind, just in case you can make these interfaces more general.
> > 
> 
> Alex suggested using sparse region instead of ioctl, I'm making note of
> your point to define structures when implementing this interface using
> sparse mmap region.
> 
> 
> >> + * Flag VFIO_MIGRATION_PROBE:
> >> + *      To query if feature is supported
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_REGION:
> >> + *      To get migration region info
> >> + *      region_index [output] : region index to be used for migration region
> >> + *      size [output]: size of migration region
> >> + *
> >> + * Flag VFIO_MIGRATION_SET_STATE:
> >> + *      To set device state in vendor driver
> >> + *      device_state [input] : User space app sends device state to vendor
> >> + *           driver on state change
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_PENDING:
> >> + *      To get pending bytes yet to be migrated from vendor driver
> >> + *      threshold_size [Input] : threshold of buffer in User space app.
> >> + *      pending_precopy_only [output] : pending data which must be migrated in
> >> + *          precopy phase or in stopped state, in other words - before target
> >> + *          vm start
> >> + *      pending_compatible [output] : pending data which may be migrated in any
> >> + *           phase
> >> + *      pending_postcopy_only [output] : pending data which must be migrated in
> >> + *           postcopy phase or in stopped state, in other words - after source
> >> + *           vm stop
> >> + *      Sum of pending_precopy_only, pending_compatible and
> >> + *      pending_postcopy_only is the whole amount of pending data.
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_BUFFER:
> >> + *      On this flag, vendor driver should write data to migration region and
> >> + *      return number of bytes written in the region.
> >> + *      bytes_written [output] : number of bytes written in migration buffer by
> >> + *          vendor driver
> >> + *
> >> + * Flag VFIO_MIGRATION_SET_BUFFER
> >> + *      In migration resume path, user space app writes to migration region and
> >> + *      communicates it to vendor driver with this ioctl with this flag.
> >> + *      bytes_written [Input] : number of bytes written in migration buffer by
> >> + *          user space app.
> > 
> > I guess we call getbuffer/setbuffer multiple times.  Is there anything
> > that needs to be passed to get the data to line up (e.g. offset into the
> > data)
> > 
> 
> I think, vendor driver can keep track of offset as it knows what data
> copied and what is remaining.

OK.
I'm not sure if your 'threshold size' definition matches qemu's; QEMU's
idea is a threshold below which we should stop the precopy phase.

I'm also not too sure how your GET_BUFFER works;  if GET_BUFFER returns
a full buffer, I guess we have to call it again to get more data - how
do I know I need to call it again?

(Some of these things would be clearer if you posted an RFC for the QEMU
side as well).

> 
> >> + * Flag VFIO_MIGRATION_GET_DIRTY_PFNS
> >> + *      Get bitmap of dirty pages from vendor driver from given start address.
> >> + *      start_addr [Input] : start address
> >> + *      pfn_count [Input] : Total pfn count from start_addr for which dirty
> >> + *          bitmap is requested
> >> + *      dirty_bitmap [Output] : bitmap memory allocated by user space
> >> + *           application, vendor driver should return the bitmap with bits set
> >> + *           only for pages to be marked dirty.
> >> + * Return: 0 on success, -errno on failure.
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +	__u32 argsz;
> >> +	__u32 flags;
> >> +#define VFIO_MIGRATION_PROBE            (1 << 0)
> >> +#define VFIO_MIGRATION_GET_REGION       (1 << 1)
> >> +#define VFIO_MIGRATION_SET_STATE        (1 << 2)
> >> +#define VFIO_MIGRATION_GET_PENDING      (1 << 3)
> >> +#define VFIO_MIGRATION_GET_BUFFER       (1 << 4)
> >> +#define VFIO_MIGRATION_SET_BUFFER       (1 << 5)
> >> +#define VFIO_MIGRATION_GET_DIRTY_PFNS   (1 << 6)
> >> +        __u32 region_index;	            /* region index */
> >> +        __u64 size;	                    /* size */
> >> +        __u32 device_state;                 /* VFIO device state */
> >> +        __u64 pending_precopy_only;
> >> +        __u64 pending_compatible;
> >> +        __u64 pending_postcopy_only;
> >> +        __u64 threshold_size;
> >> +        __u64 bytes_written;
> >> +        __u64 start_addr;
> >> +        __u64 pfn_count;
> >> +	__u8  dirty_bitmap[];
> >> +};
> > 
> > This feels like it should be separate types for the different calls.
> > 
> > Also, is there no state to tell the userspace what version of migration
> > data we have?  What happens if I try and load a migration state
> > from an older driver into a newer one, or the other way around,
> > or into a destination card that's not the same as the source.
> > 
> 
> As I mentioned in other reply, this RFC is mainly focused on core logic
> of live migration which include implementation for pre-copy, stop-n-copy
> and resume phases, defining device states.
> 
> >> +enum {
> >> +    VFIO_DEVICE_STATE_NONE,
> >> +    VFIO_DEVICE_STATE_RUNNING,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> >> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY_ACTIVE,
> >> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY_ACTIVE,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> >> +};
> > 
> > That's an interesting merge of QEMU's run-state and it's migration
> > state.
> > 
> > You probably need to define which transitions you take as legal.
> >
> 
> In reply to Alex, I had listed down allowable state transitions.

Yes; we regularly find that we have to add extra transitions we hadn't
thought of.

Can you think of anyway of making postcopy work with your devices?
It's tricky because we'd have to find a way to stop your devices
accessing memory that hasn't arrived yet.

Dave

> Thanks,
> Kirti
> 
> > Dave
> > 
> >> +
> >> +#define VFIO_DEVICE_MIGRATION_INFO _IO(VFIO_TYPE, VFIO_BASE + 17)
> >> +
> >>  /* -------- API for Type1 VFIO IOMMU -------- */
> >>  
> >>  /**
> >> -- 
> >> 2.7.0
> >>
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-10-18  9:24 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-16 18:12 [RFC PATCH v1 0/4] Add migration support for VFIO device Kirti Wankhede
2018-10-16 18:12 ` [Qemu-devel] " Kirti Wankhede
2018-10-16 18:12 ` [RFC PATCH v1 1/4] VFIO KABI for migration interface Kirti Wankhede
2018-10-16 18:12   ` [Qemu-devel] " Kirti Wankhede
2018-10-16 22:34   ` Alex Williamson
2018-10-16 22:34     ` [Qemu-devel] " Alex Williamson
2018-10-17 20:47     ` Kirti Wankhede
2018-10-17 20:47       ` [Qemu-devel] " Kirti Wankhede
2018-10-17 23:14       ` Alex Williamson
2018-10-17 23:14         ` [Qemu-devel] " Alex Williamson
2018-10-18  1:49       ` Gonglei (Arei)
2018-10-18  1:49         ` [Qemu-devel] " Gonglei (Arei)
2018-10-17  9:06   ` Christoph Hellwig
2018-10-17  9:06     ` [Qemu-devel] " Christoph Hellwig
2018-10-17 10:09   ` Dr. David Alan Gilbert
2018-10-17 10:09     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-10-17 21:17     ` Kirti Wankhede
2018-10-17 21:17       ` [Qemu-devel] " Kirti Wankhede
2018-10-18  9:23       ` Dr. David Alan Gilbert
2018-10-18  9:23         ` [Qemu-devel] " Dr. David Alan Gilbert
2018-10-16 18:12 ` [RFC PATCH v1 2/4] Add migration functions for VFIO devices Kirti Wankhede
2018-10-16 18:12   ` [Qemu-devel] " Kirti Wankhede
2018-10-17  9:00   ` Cornelia Huck
2018-10-17  9:00     ` [Qemu-devel] " Cornelia Huck
2018-10-17 21:03     ` Kirti Wankhede
2018-10-17 21:03       ` [Qemu-devel] " Kirti Wankhede
2018-10-16 18:12 ` [RFC PATCH v1 3/4] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
2018-10-16 18:12   ` [Qemu-devel] " Kirti Wankhede
2018-10-16 18:12 ` [RFC PATCH v1 4/4] Make vfio-pci device migration capable Kirti Wankhede
2018-10-16 18:12   ` [Qemu-devel] " Kirti Wankhede
2018-10-17  8:49 ` [RFC PATCH v1 0/4] Add migration support for VFIO device Cornelia Huck
2018-10-17  8:49   ` [Qemu-devel] " Cornelia Huck
2018-10-17 20:59   ` Kirti Wankhede
2018-10-17 20:59     ` [Qemu-devel] " Kirti Wankhede
2018-10-18  2:41 ` Tian, Kevin
2018-10-18  2:41   ` [Qemu-devel] " Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.