qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
@ 2019-06-20 14:37 Kirti Wankhede
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface Kirti Wankhede
                   ` (13 more replies)
  0 siblings, 14 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Add migration support for VFIO device

This Patch set include patches as below:
- Define KABI for VFIO device for migration support.
- Added save and restore functions for PCI configuration space
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.
  * Added function to get dirty pages bitmap for the pages which are used by
    driver.
- Add vfio_listerner_log_sync to mark dirty pages.
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    On migration completion, vCPUs stops and calls .save_live_complete_precopy
    for each active device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, _STOPPED)
                        |
    (FINISH_MIGRATE, _COMPLETED, STOPPED)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, STOPPED)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
                        |
        (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti


Kirti Wankhede (13):
  vfio: KABI for migration interface
  vfio: Add function to unmap VFIO region
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add VM state change handler to know state of VM
  vfio: Add migration state change notifier
  vfio: Register SaveVMHandlers for VFIO device
  vfio: Add save state functions to SaveVMHandlers
  vfio: Add load state functions to SaveVMHandlers
  vfio: Add function to get dirty page list
  vfio: Add vfio_listerner_log_sync to mark dirty pages
  vfio: Make vfio-pci device migration capable.
  vfio: Add trace events in migration code path

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  55 +++
 hw/vfio/migration.c           | 815 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 126 ++++++-
 hw/vfio/pci.h                 |  29 ++
 hw/vfio/trace-events          |  19 +
 include/hw/vfio/vfio-common.h |  22 ++
 linux-headers/linux/vfio.h    |  71 ++++
 8 files changed, 1132 insertions(+), 7 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-20 17:18   ` Alex Williamson
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

- Defined MIGRATION region type and sub-type.
- Used 3 bits to define VFIO device states.
    Bit 0 => _RUNNING
    Bit 1 => _SAVING
    Bit 2 => _RESUMING
    Combination of these bits defines VFIO device's state during migration
    _STOPPED => All bits 0 indicates VFIO device stopped.
    _RUNNING => Normal VFIO device running state.
    _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
                          saving state of device i.e. pre-copy state
    _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
                          save device state,i.e. stop-n-copy state
    _RESUMING => VFIO device resuming state.
    _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access:
    * device_state: (read/write)
        To convey VFIO device state to be transitioned to. Only 3 bits are used
        as of now.
    * pending bytes: (read only)
        To get pending bytes yet to be migrated for VFIO device.
    * data_offset: (read only)
        To get data offset in migration from where data exist during _SAVING
        and from where data should be written by user space application during
         _RESUMING state
    * data_size: (read/write)
        To get and set size of data copied in migration region during _SAVING
        and _RESUMING state.
    * start_pfn, page_size, total_pfns: (write only)
        To get bitmap of dirty pages from vendor driver from given
        start address for total_pfns.
    * copied_pfns: (read only)
        To get number of pfns bitmap copied in migration region.
        Vendor driver should copy the bitmap with bits set only for
        pages to be marked dirty in migration region. Vendor driver
        should return 0 if there are 0 pages dirty in requested
        range. Vendor driver should return -1 to mark all pages in the section
        as dirty

Migration region looks like:
 ------------------------------------------------------------------
|vfio_device_migration_info|    data section                      |
|                          |     ///////////////////////////////  |
 ------------------------------------------------------------------
 ^                              ^                              ^
 offset 0-trapped part        data_offset                 data_size

Data section is always followed by vfio_device_migration_info
structure in the region, so data_offset will always be none-0.
Offset from where data is copied is decided by kernel driver, data
section can be trapped or mapped depending on how kernel driver
defines data section. If mmapped, then data_offset should be page
aligned, where as initial section which contain
vfio_device_migration_info structure might not end at offset which
is page aligned.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 24f505199f83..274ec477eb82 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
  */
 #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
 
+/* Migration region type and sub-type */
+#define VFIO_REGION_TYPE_MIGRATION	        (2)
+#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
+
+/**
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise should return error.
+ *
+ * device_state: (read/write)
+ *      To indicate vendor driver the state VFIO device should be transitioned
+ *      to. If device state transition fails, write to this field return error.
+ *      It consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
+ *        _STOPPED state. When device is changed to _STOPPED, driver should stop
+ *        device before write returns.
+ *      - If bit 1 set, indicates _SAVING state.
+ *      - If bit 2 set, indicates _RESUMING state.
+ *
+ * pending bytes: (read only)
+ *      Read pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read data during _SAVING state or write data
+ *      during _RESUMING state.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to know data copied in migration
+ *      region during _SAVING state and write size of data copied in migration
+ *      region during _RESUMING state.
+ *
+ * start_pfn: (write only)
+ *      Start address pfn to get bitmap of dirty pages from vendor driver duing
+ *      _SAVING state.
+ *
+ * page_size: (write only)
+ *      User application should write the page_size of pfn.
+ *
+ * total_pfns: (write only)
+ *      Total pfn count from start_pfn for which dirty bitmap is requested.
+ *
+ * copied_pfns: (read only)
+ *      pfn count for which dirty bitmap is copied to migration region.
+ *      Vendor driver should copy the bitmap with bits set only for pages to be
+ *      marked dirty in migration region.
+ *      Vendor driver should return 0 if there are 0 pages dirty in requested
+ *      range.
+ *      Vendor driver should return -1 to mark all pages in the section as
+ *      dirty.
+ */
+
+struct vfio_device_migration_info {
+        __u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_STOPPED   (0)
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
+                                     VFIO_DEVICE_STATE_RESUMING)
+        __u32 reserved;
+        __u64 pending_bytes;
+        __u64 data_offset;
+        __u64 data_size;
+        __u64 start_pfn;
+        __u64 page_size;
+        __u64 total_pfns;
+        __s64 copied_pfns;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 02/13] vfio: Add function to unmap VFIO region
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

This function is used in follwing patch in this series.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c              | 20 ++++++++++++++++++++
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 22 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a859298fdad9..de74dae8d6a6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -964,6 +964,26 @@ int vfio_region_mmap(VFIORegion *region)
     return 0;
 }
 
+void vfio_region_unmap(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        trace_vfio_region_unmap(memory_region_name(&region->mmaps[i].mem),
+                                region->mmaps[i].offset,
+                                region->mmaps[i].offset +
+                                region->mmaps[i].size - 1);
+        memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
+        munmap(region->mmaps[i].mmap, region->mmaps[i].size);
+        object_unparent(OBJECT(&region->mmaps[i].mem));
+        region->mmaps[i].mmap = NULL;
+    }
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
     int i;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33ffd..8cdc27946cb8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -111,6 +111,7 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a88b69b6750e..ef078cf60ef9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -176,6 +176,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                       int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface Kirti Wankhede
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-21  0:12   ` Yan Zhao
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 04/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

These functions save and restore PCI device specific data - config
space of PCI device.
Tested save and restore with MSI and MSIX type.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h |  29 +++++++++++++++
 2 files changed, 141 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ce3fe96efe2c..09a0821a5b1c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1187,6 +1187,118 @@ void vfio_pci_write_config(PCIDevice *pdev,
     }
 }
 
+void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint16_t pci_cmd;
+    int i;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar;
+
+        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+        qemu_put_be32(f, bar);
+    }
+
+    qemu_put_be32(f, vdev->interrupt);
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+
+        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                                            2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        qemu_put_be32(f, msi_addr_lo);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                             4);
+        }
+        qemu_put_be32(f, msi_addr_hi);
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+        qemu_put_be32(f, msi_data);
+    } else if (vdev->interrupt == VFIO_INT_MSIX) {
+        uint16_t offset;
+
+        /* save enable bit and maskall bit */
+        offset = pci_default_read_config(pdev,
+                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
+        qemu_put_be16(f, offset);
+        msix_save(pdev, f);
+    }
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    qemu_put_be16(f, pci_cmd);
+}
+
+void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t interrupt_type;
+    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+    uint16_t pci_cmd;
+    bool msi_64bit;
+    int i;
+
+    /* retore pci bar configuration */
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar = qemu_get_be32(f);
+
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+    }
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
+
+    interrupt_type = qemu_get_be32(f);
+
+    if (interrupt_type == VFIO_INT_MSI) {
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+        msi_addr_lo = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo, 4);
+
+        msi_addr_hi = qemu_get_be32(f);
+        if (msi_64bit) {
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+        msi_data = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                msi_data, 2);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
+    } else if (interrupt_type == VFIO_INT_MSIX) {
+        uint16_t offset = qemu_get_be16(f);
+
+        /* load enable bit and maskall bit */
+        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
+                              offset, 2);
+        msix_load(pdev, f);
+    }
+    pci_cmd = qemu_get_be16(f);
+    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
+}
+
 /*
  * Interrupt setup
  */
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 834a90d64686..847be5f56478 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -19,6 +19,7 @@
 #include "qemu/queue.h"
 #include "qemu/timer.h"
 
+#ifdef CONFIG_LINUX
 #define PCI_ANY_ID (~0)
 
 struct VFIOPCIDevice;
@@ -202,4 +203,32 @@ void vfio_display_reset(VFIOPCIDevice *vdev);
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
 void vfio_display_finalize(VFIOPCIDevice *vdev);
 
+void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f);
+void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f);
+
+static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+    return OBJECT(vdev);
+}
+
+#else
+static inline void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    g_assert(false);
+}
+
+static inline void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    g_assert(false);
+}
+
+static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    return NULL;
+}
+
+#endif
+
 #endif /* HW_VFIO_VFIO_PCI_H */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 04/13] vfio: Add migration region initialization and finalize function
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (2 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-24 14:00   ` Cornelia Huck
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 05/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

- Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
  patch series.
- VFIO device supports migration or not is decided based of migration region
  query. If migration region query is successful then migration is supported
  else migration is blocked.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 137 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  14 +++++
 3 files changed, 152 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index abad8b818c9b..36033d1437c5 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,4 +1,4 @@
-obj-y += common.o spapr.o
+obj-y += common.o spapr.o migration.o
 obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_VFIO_PLATFORM) += platform.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..ba58d9253d26
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,137 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2019
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.buffer.size) {
+        vfio_region_exit(&migration->region.buffer);
+        vfio_region_finalize(&migration->region.buffer);
+    }
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    Object *obj = NULL;
+    int ret = -EINVAL;
+
+    if (!migration) {
+        return ret;
+    }
+
+    /* Migration support added for PCI device only */
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        obj = vfio_pci_get_object(vbasedev);
+    }
+
+    if (!obj) {
+        return ret;
+    }
+
+    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
+                            migration->region.index, "migration");
+    if (ret) {
+        error_report("Failed to setup VFIO migration region %d: %s",
+                      migration->region.index, strerror(-ret));
+        goto err;
+    }
+
+    if (!migration->region.buffer.size) {
+        ret = -EINVAL;
+        error_report("Invalid region size of VFIO migration region %d: %s",
+                     migration->region.index, strerror(-ret));
+        goto err;
+    }
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+                               struct vfio_region_info *info)
+{
+    int ret;
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+    vbasedev->migration->region.index = info->index;
+
+    ret = vfio_migration_region_init(vbasedev);
+    if (ret) {
+        error_report("Failed to initialise migration region");
+        return ret;
+    }
+
+    return 0;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_region_info *info;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    if (ret) {
+        Error *local_err = NULL;
+
+        error_setg(&vbasedev->migration_blocker,
+                   "VFIO device doesn't support migration");
+        ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            error_free(vbasedev->migration_blocker);
+            return ret;
+        }
+    } else {
+        return vfio_migration_init(vbasedev, info);
+    }
+
+    return 0;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (!vbasedev->migration) {
+        return;
+    }
+
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    g_free(vbasedev->migration);
+}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ef078cf60ef9..1374a03470d8 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -57,6 +57,15 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    struct {
+        VFIORegion buffer;
+        uint32_t index;
+    } region;
+    uint64_t pending_bytes;
+    QemuMutex lock;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -118,6 +127,8 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -206,4 +217,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 05/13] vfio: Add VM state change handler to know state of VM
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (3 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 04/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-25 10:29   ` Dr. David Alan Gilbert
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 06/13] vfio: Add migration state change notifier Kirti Wankhede
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

VM state change handler gets called on change in VM's state. This is used to set
VFIO device state to _RUNNING.
VM state change handler, migration state change handler and log_sync listener
are called asynchronously, which sometimes lead to data corruption in migration
region. Initialised mutex that is used to serialize operations on migration data
region during saving state.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 45 +++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  4 ++++
 2 files changed, 49 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ba58d9253d26..15af218c23d1 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -77,6 +77,41 @@ err:
     return ret;
 }
 
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    int ret = 0;
+
+    ret = pwrite(vbasedev->fd, &state, sizeof(state),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("Failed to set migration state %d %s",
+                     ret, strerror(errno));
+        return ret;
+    }
+
+    vbasedev->device_state = state;
+    return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running) && running) {
+        int ret;
+
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("Failed to set state RUNNING");
+        }
+    }
+
+    vbasedev->vm_running = running;
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -91,6 +126,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    qemu_mutex_init(&vbasedev->migration->lock);
+
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
+
     return 0;
 }
 
@@ -127,11 +167,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
         return;
     }
 
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+    }
+
     if (vbasedev->migration_blocker) {
         migrate_del_blocker(vbasedev->migration_blocker);
         error_free(vbasedev->migration_blocker);
     }
 
+    qemu_mutex_destroy(&vbasedev->migration->lock);
     vfio_migration_region_exit(vbasedev);
     g_free(vbasedev->migration);
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1374a03470d8..f2392e97fa57 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
+#include "sysemu/sysemu.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -129,6 +130,9 @@ typedef struct VFIODevice {
     unsigned int flags;
     VFIOMigration *migration;
     Error *migration_blocker;
+    uint32_t device_state;
+    VMChangeStateEntry *vm_state;
+    int vm_running;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 06/13] vfio: Add migration state change notifier
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (4 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 05/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-27 10:33   ` Dr. David Alan Gilbert
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 07/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to vendor
driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 49 +++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  1 +
 2 files changed, 50 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 15af218c23d1..7f9858e6c995 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -112,6 +112,48 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
     vbasedev->vm_running = running;
 }
 
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    switch (s->state) {
+    case MIGRATION_STATUS_ACTIVE:
+        if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
+            if (vbasedev->vm_running) {
+                ret = vfio_migration_set_state(vbasedev,
+                          VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
+                if (ret) {
+                    error_report("Failed to set state RUNNING and SAVING");
+                }
+            } else {
+                ret = vfio_migration_set_state(vbasedev,
+                                               VFIO_DEVICE_STATE_SAVING);
+                if (ret) {
+                    error_report("Failed to set state STOP and SAVING");
+                }
+            }
+        } else {
+            ret = vfio_migration_set_state(vbasedev,
+                                           VFIO_DEVICE_STATE_RESUMING);
+            if (ret) {
+                error_report("Failed to set state RESUMING");
+            }
+        }
+        return;
+
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("Failed to set state RUNNING");
+        }
+        return;
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -131,6 +173,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
+
     return 0;
 }
 
@@ -167,6 +212,10 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
         return;
     }
 
+    if (vbasedev->migration_state.notify) {
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
     if (vbasedev->vm_state) {
         qemu_del_vm_change_state_handler(vbasedev->vm_state);
     }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f2392e97fa57..1d26e6be8d48 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -133,6 +133,7 @@ typedef struct VFIODevice {
     uint32_t device_state;
     VMChangeStateEntry *vm_state;
     int vm_running;
+    Notifier migration_state;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 07/13] vfio: Register SaveVMHandlers for VFIO device
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (5 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 06/13] vfio: Add migration state change notifier Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-27 10:01   ` Dr. David Alan Gilbert
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Define flags to be used as delimeter in migration file stream.
Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
region from these functions at source during saving or pre-copy phase.
Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 75 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 7f9858e6c995..fe0887c27664 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -22,6 +22,17 @@
 #include "exec/ram_addr.h"
 #include "pci.h"
 
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -96,6 +107,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
     return 0;
 }
 
+/* ---------------------------------------------------------------------- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    if (migration->region.buffer.mmaps) {
+        qemu_mutex_lock_iothread();
+        ret = vfio_region_mmap(&migration->region.buffer);
+        qemu_mutex_unlock_iothread();
+        if (ret) {
+            error_report("Failed to mmap VFIO migration region %d: %s",
+                         migration->region.index, strerror(-ret));
+            return ret;
+        }
+    }
+
+    if (vbasedev->vm_running) {
+        ret = vfio_migration_set_state(vbasedev,
+                         VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
+        if (ret) {
+            error_report("Failed to set state RUNNING and SAVING");
+            return ret;
+        }
+    } else {
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
+        if (ret) {
+            error_report("Failed to set state STOP and SAVING");
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->region.buffer.mmaps) {
+        vfio_region_unmap(&migration->region.buffer);
+    }
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+};
+
+/* ---------------------------------------------------------------------- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -169,7 +243,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     }
 
     qemu_mutex_init(&vbasedev->migration->lock);
-
+    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (6 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 07/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-20 19:25   ` Alex Williamson
                     ` (2 more replies)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 09/13] vfio: Add load " Kirti Wankhede
                   ` (5 subsequent siblings)
  13 siblings, 3 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes
- read data_offset - indicates kernel driver to write data to staging
  buffer which is mmapped.
- read data_size - amount of data in bytes written by vendor driver in migration
  region.
- if data section is trapped, pread() number of bytes in data_size, from
  data_offset.
- if data section is mmaped, read mmaped buffer of size data_size.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes - indicates kernel driver to write data to staging
   buffer which is mmapped.
c. read data_size - amount of data in bytes written by vendor driver in
   migration region.
d. if data section is trapped, pread() from data_offset of size data_size.
e. if data section is mmaped, read mmaped buffer of size data_size.
f. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
g. iterate through steps b to f until (pending_bytes > 0)
h. Write {VFIO_MIG_FLAG_END_OF_STATE}

.save_live_iterate runs outside the iothread lock in the migration case, which
could race with asynchronous call to get dirty page list causing data corruption
in mapped migration region. Mutex added here to serial migration buffer read
operation.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 212 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index fe0887c27664..0a2f30872316 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
     return 0;
 }
 
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint64_t data_offset = 0, data_size = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+    if (ret != sizeof(data_offset)) {
+        error_report("Failed to get migration buffer data offset %d",
+                     ret);
+        return -EINVAL;
+    }
+
+    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_size));
+    if (ret != sizeof(data_size)) {
+        error_report("Failed to get migration buffer data size %d",
+                     ret);
+        return -EINVAL;
+    }
+
+    if (data_size > 0) {
+        void *buf = NULL;
+        bool buffer_mmaped = false;
+
+        if (region->mmaps) {
+            int i;
+
+            for (i = 0; i < region->nr_mmaps; i++) {
+                if ((data_offset >= region->mmaps[i].offset) &&
+                    (data_offset < region->mmaps[i].offset +
+                                   region->mmaps[i].size)) {
+                    buf = region->mmaps[i].mmap + (data_offset -
+                                                   region->mmaps[i].offset);
+                    buffer_mmaped = true;
+                    break;
+                }
+            }
+        }
+
+        if (!buffer_mmaped) {
+            buf = g_malloc0(data_size);
+            ret = pread(vbasedev->fd, buf, data_size,
+                        region->fd_offset + data_offset);
+            if (ret != data_size) {
+                error_report("Failed to get migration data %d", ret);
+                g_free(buf);
+                return -EINVAL;
+            }
+        }
+
+        qemu_put_be64(f, data_size);
+        qemu_put_buffer(f, buf, data_size);
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+        migration->pending_bytes -= data_size;
+    } else {
+        qemu_put_be64(f, data_size);
+    }
+
+    ret = qemu_file_get_error(f);
+
+    return data_size;
+}
+
+static int vfio_update_pending(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint64_t pending_bytes = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending_bytes));
+    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
+        error_report("Failed to get pending bytes %d", ret);
+        migration->pending_bytes = 0;
+        return (ret < 0) ? ret : -EINVAL;
+    }
+
+    migration->pending_bytes = pending_bytes;
+    return 0;
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        vfio_pci_save_config(vbasedev, f);
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
     }
 }
 
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return;
+    }
+
+    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
+        *res_precopy_only += migration->pending_bytes;
+    } else {
+        *res_postcopy_only += migration->pending_bytes;
+    }
+    *res_compatible += 0;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+
+    qemu_mutex_lock(&migration->lock);
+    ret = vfio_save_buffer(f, vbasedev);
+    qemu_mutex_unlock(&migration->lock);
+
+    if (ret < 0) {
+        error_report("vfio_save_buffer failed %s",
+                     strerror(errno));
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return ret;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("Failed to set state STOP and SAVING");
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    while (migration->pending_bytes > 0) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+        ret = vfio_save_buffer(f, vbasedev);
+        if (ret < 0) {
+            error_report("Failed to save buffer");
+            return ret;
+        } else if (ret == 0) {
+            break;
+        }
+
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
+    if (ret) {
+        error_report("Failed to set state STOPPED");
+        return ret;
+    }
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
 };
 
 /* ---------------------------------------------------------------------- */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 09/13] vfio: Add load state functions to SaveVMHandlers
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (7 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-28  9:18   ` Dr. David Alan Gilbert
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 10/13] vfio: Add function to get dirty page list Kirti Wankhede
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

During _RESUMING device state:
- If Vendor driver defines mappable region, mmap migration region.
- Load config state.
- For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
    - read data_size from packet, read buffer of data_size
    - read data_offset from where QEMU should write data.
        if region is mmaped, write data of data_size to mmaped region.
    - write data_size.
        In case of mmapped region, write to data_size indicates kernel
        driver that data is written in staging buffer.
    - if region is trapped, pwrite() data of data_size from data_offset.
- Repeat above until VFIO_MIG_FLAG_END_OF_STATE.
- Unmap migration region.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 153 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 0a2f30872316..e4895f91761d 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -212,6 +212,22 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        vfio_pci_load_config(vbasedev, f);
+    }
+
+    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("Wrong end of block while loading device config space");
+        return -EINVAL;
+    }
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -372,12 +388,149 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+
+    if (migration->region.buffer.mmaps) {
+        ret = vfio_region_mmap(&migration->region.buffer);
+        if (ret) {
+            error_report("Failed to mmap VFIO migration region %d: %s",
+                         migration->region.index, strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING);
+    if (ret) {
+        error_report("Failed to set state RESUMING");
+    }
+    return ret;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    vfio_save_cleanup(opaque);
+    return 0;
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+    uint64_t data, data_size;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        switch (data) {
+        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
+        {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
+        {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return ret;
+            } else {
+                error_report("SETUP STATE: EOS not found 0x%lx", data);
+                return -EINVAL;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE:
+        {
+            VFIORegion *region = &migration->region.buffer;
+            void *buf = NULL;
+            bool buffer_mmaped = false;
+            uint64_t data_offset = 0;
+
+            data_size = qemu_get_be64(f);
+            if (data_size == 0) {
+                break;
+            }
+
+            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                        region->fd_offset +
+                        offsetof(struct vfio_device_migration_info,
+                        data_offset));
+            if (ret != sizeof(data_offset)) {
+                error_report("Failed to get migration buffer data offset %d",
+                              ret);
+                return -EINVAL;
+            }
+
+            if (region->mmaps) {
+                int i;
+
+                for (i = 0; i < region->nr_mmaps; i++) {
+                    if (region->mmaps[i].mmap &&
+                        (data_offset >= region->mmaps[i].offset) &&
+                        (data_offset < region->mmaps[i].offset +
+                                       region->mmaps[i].size)) {
+                        buf = region->mmaps[i].mmap + (data_offset -
+                                                      region->mmaps[i].offset);
+                        buffer_mmaped = true;
+                        break;
+                    }
+                }
+            }
+
+            if (!buffer_mmaped) {
+                buf = g_malloc0(data_size);
+            }
+
+            qemu_get_buffer(f, buf, data_size);
+
+            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
+                         region->fd_offset +
+                       offsetof(struct vfio_device_migration_info, data_size));
+            if (ret != sizeof(data_size)) {
+                error_report("Failed to set migration buffer data size %d",
+                             ret);
+                return -EINVAL;
+            }
+
+            if (!buffer_mmaped) {
+                ret = pwrite(vbasedev->fd, buf, data_size,
+                             region->fd_offset + data_offset);
+                g_free(buf);
+
+                if (ret != data_size) {
+                    error_report("Failed to set migration buffer %d", ret);
+                    return -EINVAL;
+                }
+            }
+            break;
+        }
+        }
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+        data = qemu_get_be64(f);
+    }
+
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
 };
 
 /* ---------------------------------------------------------------------- */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 10/13] vfio: Add function to get dirty page list
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (8 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 09/13] vfio: Add load " Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-26  0:40   ` Yan Zhao
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 11/13] vfio: Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Dirty page tracking (.log_sync) is part of RAM copying state, where
vendor driver provides the bitmap of pages which are dirtied by vendor
driver through migration region and as part of RAM copy, those pages
gets copied to file stream.

To get dirty page bitmap:
- write start address, page_size and pfn count.
- read count of pfns copied.
    - Vendor driver should return 0 if driver doesn't have any page to
      report dirty in given range.
    - Vendor driver should return -1 to mark all pages dirty for given range.
- read data_offset, where vendor driver has written bitmap.
- read bitmap from the region or mmaped part of the region. This copy is
  iterated till page bitmap for all requested pfns are copied.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 119 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   2 +
 2 files changed, 121 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e4895f91761d..68775b5dec11 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -228,6 +228,125 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+void vfio_get_dirty_page_list(VFIODevice *vbasedev,
+                              uint64_t start_pfn,
+                              uint64_t pfn_count,
+                              uint64_t page_size)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint64_t count = 0;
+    int64_t copied_pfns = 0;
+    int ret;
+
+    qemu_mutex_lock(&migration->lock);
+    ret = pwrite(vbasedev->fd, &start_pfn, sizeof(start_pfn),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              start_pfn));
+    if (ret < 0) {
+        error_report("Failed to set dirty pages start address %d %s",
+                ret, strerror(errno));
+        goto dpl_unlock;
+    }
+
+    ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              page_size));
+    if (ret < 0) {
+        error_report("Failed to set dirty page size %d %s",
+                ret, strerror(errno));
+        goto dpl_unlock;
+    }
+
+    ret = pwrite(vbasedev->fd, &pfn_count, sizeof(pfn_count),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              total_pfns));
+    if (ret < 0) {
+        error_report("Failed to set dirty page total pfns %d %s",
+                ret, strerror(errno));
+        goto dpl_unlock;
+    }
+
+    do {
+        uint64_t bitmap_size, data_offset = 0;
+        void *buf = NULL;
+        bool buffer_mmaped = false;
+
+        /* Read copied dirty pfns */
+        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             copied_pfns));
+        if (ret < 0) {
+            error_report("Failed to get dirty pages bitmap count %d %s",
+                    ret, strerror(errno));
+            goto dpl_unlock;
+        }
+
+        if (copied_pfns == 0) {
+            /*
+             * copied_pfns could be 0 if driver doesn't have any page to
+             * report dirty in given range
+             */
+            break;
+        } else if (copied_pfns == -1) {
+            /* Mark all pages dirty for this range */
+            cpu_physical_memory_set_dirty_range(start_pfn * page_size,
+                                                pfn_count * page_size,
+                                                DIRTY_MEMORY_MIGRATION);
+            break;
+        }
+
+        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
+
+        ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+        if (ret != sizeof(data_offset)) {
+            error_report("Failed to get migration buffer data offset %d",
+                         ret);
+            goto dpl_unlock;
+        }
+
+        if (region->mmaps) {
+            int i;
+            for (i = 0; i < region->nr_mmaps; i++) {
+                if ((region->mmaps[i].offset >= data_offset) &&
+                    (data_offset < region->mmaps[i].offset +
+                                   region->mmaps[i].size)) {
+                    buf = region->mmaps[i].mmap + (data_offset -
+                                                   region->mmaps[i].offset);
+                    buffer_mmaped = true;
+                    break;
+                }
+            }
+        }
+
+        if (!buffer_mmaped) {
+            buf = g_malloc0(bitmap_size);
+
+            ret = pread(vbasedev->fd, buf, bitmap_size,
+                        region->fd_offset + data_offset);
+            if (ret != bitmap_size) {
+                error_report("Failed to get dirty pages bitmap %d", ret);
+                g_free(buf);
+                goto dpl_unlock;
+            }
+        }
+
+        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
+                                               (start_pfn + count) * page_size,
+                                                copied_pfns);
+        count +=  copied_pfns;
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+    } while (count < pfn_count);
+
+dpl_unlock:
+    qemu_mutex_unlock(&migration->lock);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1d26e6be8d48..423d6dbccace 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -224,5 +224,7 @@ int vfio_spapr_remove_window(VFIOContainer *container,
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
 void vfio_migration_finalize(VFIODevice *vbasedev);
+void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
+                               uint64_t pfn_count, uint64_t page_size);
 
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 11/13] vfio: Add vfio_listerner_log_sync to mark dirty pages
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (9 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 10/13] vfio: Add function to get dirty page list Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 12/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

vfio_listerner_log_sync gets list of dirty pages from vendor driver and mark
those pages dirty when in _SAVING state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index de74dae8d6a6..d5ee35c95e76 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -36,6 +36,7 @@
 #include "sysemu/kvm.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -794,9 +795,43 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_listerner_log_sync(MemoryListener *listener,
+                                    MemoryRegionSection *section)
+{
+    uint64_t start_addr, size, pfn_count;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    if (memory_region_is_ram_device(section->mr)) {
+        return;
+    }
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
+                continue;
+            } else {
+                return;
+            }
+        }
+    }
+
+    start_addr = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    size = int128_get64(section->size);
+    pfn_count = size >> TARGET_PAGE_BITS;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vfio_get_dirty_page_list(vbasedev, start_addr >> TARGET_PAGE_BITS,
+                                     pfn_count, TARGET_PAGE_SIZE);
+        }
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 12/13] vfio: Make vfio-pci device migration capable.
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (10 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 11/13] vfio: Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 13/13] vfio: Add trace events in migration code path Kirti Wankhede
  2019-06-21  0:25 ` [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Yan Zhao
  13 siblings, 0 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Call vfio_migration_probe() and vfio_migration_finalize() functions for
vfio-pci device to enable migration for vfio PCI device.
Removed vfio_pci_vmstate structure.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 09a0821a5b1c..b230b0ab9282 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2839,6 +2839,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = DEVICE(vdev);
+    vdev->vbasedev.device_state = VFIO_DEVICE_STATE_STOPPED;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3099,6 +3100,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    ret = vfio_migration_probe(&vdev->vbasedev, errp);
+    if (ret) {
+            error_report("Failed to setup migration region");
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3118,6 +3124,7 @@ static void vfio_instance_finalize(Object *obj)
     VFIOPCIDevice *vdev = PCI_VFIO(obj);
     VFIOGroup *group = vdev->vbasedev.group;
 
+    vdev->vbasedev.device_state = VFIO_DEVICE_STATE_STOPPED;
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
     g_free(vdev->emulated_config_bits);
@@ -3146,6 +3153,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
@@ -3254,11 +3262,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3266,7 +3269,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH v4 13/13] vfio: Add trace events in migration code path
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (11 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 12/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2019-06-20 14:37 ` Kirti Wankhede
  2019-06-20 18:50   ` Dr. David Alan Gilbert
  2019-06-21  0:25 ` [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Yan Zhao
  13 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-20 14:37 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Kirti Wankhede, Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao,
	eskultet, ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 26 ++++++++++++++++++++++++++
 hw/vfio/trace-events | 18 ++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 68775b5dec11..70c03f1a969f 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -21,6 +21,7 @@
 #include "exec/ramlist.h"
 #include "exec/ram_addr.h"
 #include "pci.h"
+#include "trace.h"
 
 /*
  * Flags used as delimiter:
@@ -104,6 +105,7 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
     }
 
     vbasedev->device_state = state;
+    trace_vfio_migration_set_state(vbasedev->name, state);
     return 0;
 }
 
@@ -173,6 +175,8 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
         qemu_put_be64(f, data_size);
     }
 
+    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+                           migration->pending_bytes);
     ret = qemu_file_get_error(f);
 
     return data_size;
@@ -195,6 +199,7 @@ static int vfio_update_pending(VFIODevice *vbasedev)
     }
 
     migration->pending_bytes = pending_bytes;
+    trace_vfio_update_pending(vbasedev->name, pending_bytes);
     return 0;
 }
 
@@ -209,6 +214,8 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
     }
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
+    trace_vfio_save_device_config_state(vbasedev->name);
+
     return qemu_file_get_error(f);
 }
 
@@ -225,6 +232,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
         return -EINVAL;
     }
 
+    trace_vfio_load_device_config_state(vbasedev->name);
     return qemu_file_get_error(f);
 }
 
@@ -343,6 +351,9 @@ void vfio_get_dirty_page_list(VFIODevice *vbasedev,
         }
     } while (count < pfn_count);
 
+    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
+                                   page_size);
+
 dpl_unlock:
     qemu_mutex_unlock(&migration->lock);
 }
@@ -390,6 +401,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return ret;
     }
 
+    trace_vfio_save_setup(vbasedev->name);
     return 0;
 }
 
@@ -401,6 +413,7 @@ static void vfio_save_cleanup(void *opaque)
     if (migration->region.buffer.mmaps) {
         vfio_region_unmap(&migration->region.buffer);
     }
+    trace_vfio_cleanup(vbasedev->name);
 }
 
 static void vfio_save_pending(QEMUFile *f, void *opaque,
@@ -424,6 +437,7 @@ static void vfio_save_pending(QEMUFile *f, void *opaque,
         *res_postcopy_only += migration->pending_bytes;
     }
     *res_compatible += 0;
+    trace_vfio_save_pending(vbasedev->name);
 }
 
 static int vfio_save_iterate(QEMUFile *f, void *opaque)
@@ -451,6 +465,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
         return ret;
     }
 
+    trace_vfio_save_iterate(vbasedev->name);
     return ret;
 }
 
@@ -504,6 +519,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
         error_report("Failed to set state STOPPED");
         return ret;
     }
+
+    trace_vfio_save_complete_precopy(vbasedev->name);
     return ret;
 }
 
@@ -544,6 +561,9 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 
     data = qemu_get_be64(f);
     while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+
+        trace_vfio_load_state(vbasedev->name, data);
+
         switch (data) {
         case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
         {
@@ -627,6 +647,8 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
                     return -EINVAL;
                 }
             }
+            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
+                                              data_size);
             break;
         }
         }
@@ -668,6 +690,7 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
     }
 
     vbasedev->vm_running = running;
+    trace_vfio_vmstate_change(vbasedev->name, running);
 }
 
 static void vfio_migration_state_notifier(Notifier *notifier, void *data)
@@ -676,6 +699,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
     int ret;
 
+    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
+
     switch (s->state) {
     case MIGRATION_STATUS_ACTIVE:
         if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
@@ -758,6 +783,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
             return ret;
         }
     } else {
+        trace_vfio_migration_probe(vbasedev->name, info->index);
         return vfio_migration_init(vbasedev, info);
     }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8cdc27946cb8..b1f19ae7a806 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -143,3 +143,21 @@ vfio_display_edid_link_up(void) ""
 vfio_display_edid_link_down(void) ""
 vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
+
+# migration.c
+vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
+vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s), Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
+vfio_update_pending(char *name, uint64_t pending) " (%s), pending 0x%"PRIx64
+vfio_save_device_config_state(char *name) " (%s)"
+vfio_load_device_config_state(char *name) " (%s)"
+vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
+vfio_save_setup(char *name) " (%s)"
+vfio_cleanup(char *name) " (%s)"
+vfio_save_pending(char *name) " (%s)"
+vfio_save_iterate(char *name) " (%s)"
+vfio_save_complete_precopy(char *name) " (%s)"
+vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s), Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_vmstate_change(char *name, int running) " (%s) running %d"
+vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
+vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface Kirti Wankhede
@ 2019-06-20 17:18   ` Alex Williamson
  2019-06-21  5:52     ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-20 17:18 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Thu, 20 Jun 2019 20:07:29 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> - Used 3 bits to define VFIO device states.
>     Bit 0 => _RUNNING
>     Bit 1 => _SAVING
>     Bit 2 => _RESUMING
>     Combination of these bits defines VFIO device's state during migration
>     _STOPPED => All bits 0 indicates VFIO device stopped.
>     _RUNNING => Normal VFIO device running state.
>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>                           saving state of device i.e. pre-copy state
>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>                           save device state,i.e. stop-n-copy state
>     _RESUMING => VFIO device resuming state.
>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access:
>     * device_state: (read/write)
>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>         as of now.
>     * pending bytes: (read only)
>         To get pending bytes yet to be migrated for VFIO device.
>     * data_offset: (read only)
>         To get data offset in migration from where data exist during _SAVING
>         and from where data should be written by user space application during
>          _RESUMING state
>     * data_size: (read/write)
>         To get and set size of data copied in migration region during _SAVING
>         and _RESUMING state.
>     * start_pfn, page_size, total_pfns: (write only)
>         To get bitmap of dirty pages from vendor driver from given
>         start address for total_pfns.
>     * copied_pfns: (read only)
>         To get number of pfns bitmap copied in migration region.
>         Vendor driver should copy the bitmap with bits set only for
>         pages to be marked dirty in migration region. Vendor driver
>         should return 0 if there are 0 pages dirty in requested
>         range. Vendor driver should return -1 to mark all pages in the section
>         as dirty
> 
> Migration region looks like:
>  ------------------------------------------------------------------
> |vfio_device_migration_info|    data section                      |
> |                          |     ///////////////////////////////  |
>  ------------------------------------------------------------------
>  ^                              ^                              ^
>  offset 0-trapped part        data_offset                 data_size
> 
> Data section is always followed by vfio_device_migration_info
> structure in the region, so data_offset will always be none-0.
> Offset from where data is copied is decided by kernel driver, data
> section can be trapped or mapped depending on how kernel driver
> defines data section. If mmapped, then data_offset should be page
> aligned, where as initial section which contain
> vfio_device_migration_info structure might not end at offset which
> is page aligned.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 71 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 24f505199f83..274ec477eb82 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>   */
>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>  
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise should return error.
> + *
> + * device_state: (read/write)
> + *      To indicate vendor driver the state VFIO device should be transitioned
> + *      to. If device state transition fails, write to this field return error.
> + *      It consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> + *        device before write returns.
> + *      - If bit 1 set, indicates _SAVING state.
> + *      - If bit 2 set, indicates _RESUMING state.
> + *
> + * pending bytes: (read only)
> + *      Read pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read data during _SAVING state or write data
> + *      during _RESUMING state.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to know data copied in migration
> + *      region during _SAVING state and write size of data copied in migration
> + *      region during _RESUMING state.
> + *
> + * start_pfn: (write only)
> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> + *      _SAVING state.
> + *
> + * page_size: (write only)
> + *      User application should write the page_size of pfn.
> + *
> + * total_pfns: (write only)
> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> + *
> + * copied_pfns: (read only)
> + *      pfn count for which dirty bitmap is copied to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
> + *      range.
> + *      Vendor driver should return -1 to mark all pages in the section as
> + *      dirty.

Is the protocol that the user writes start_pfn/page_size/total_pfns in
any order and then the read of copied_pfns is what triggers the
snapshot?  Are start_pfn/page_size/total_pfns sticky such that a user
can write them once and get repeated refreshes of the dirty bitmap by
re-reading copied_pfns?  What's the advantage to returning -1 versus
returning copied_pfns == total_pfns?

If the user then wants to switch back to reading device migration
state, is it a read of data_size that switches the data area back to
making that address space available?  In each case, is it the user's
responsibility to consume all the data provided before triggering the
next data area?  For example, if I ask for a range of dirty bitmap, the
vendor driver will provide that range and and clear it, such that the
pages are considered clean regardless of whether the user consumed the
data area.  Likewise if the user asks for data_size, that would be
deducted from pending_bytes regardless of the user reading the data
area.  Are there any read side-effects to pending_bytes?  Are there
read side-effects to the data area on SAVING?  Are there write
side-effects on RESUMING, or is it only the write of data_size that
triggers the buffer to be consumed?  Is it the user's responsibility to
write only full "packets" on RESUMING?  For example if the SAVING side
provides data_size X, that full data_size X must be written to the
RESUMING side, the user cannot write half of it to the data area on the
RESUMING side, write data_size with X/2, write the second half, and
again write X/2.  IOW, the data_size "packet" is indivisible at the
point of resuming.

What are the ordering requirements?  Must the user write data_size
packets in the same order that they're read, or is it the vendor
driver's responsibility to include sequence information and allow
restore in any order?

> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_STOPPED   (0)

We need to be careful with how this is used if we want to leave the
possibility of using the remaining 29 bits of this register.  Maybe we
want to define VFIO_DEVICE_STATE_MASK and be sure that we only do
read-modify-write ops within the mask (ex. set_bit and clear_bit
helpers).  Also, above we define STOPPED to indicate simply
not-RUNNING, but here it seems STOPPED means not-RUNNING, not-SAVING,
and not-RESUMING.

> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> +                                     VFIO_DEVICE_STATE_RESUMING)
> +        __u32 reserved;
> +        __u64 pending_bytes;
> +        __u64 data_offset;

Placing the data more than 4GB into the region seems a bit absurd, so
this could probably be a __u32 and take the place of the reserved field.

> +        __u64 data_size;
> +        __u64 start_pfn;
> +        __u64 page_size;
> +        __u64 total_pfns;
> +        __s64 copied_pfns;

If this is signed so that we can get -1 then the user could
theoretically specify total_pfns that we can't represent in
copied_pfns.  Probably best to use unsigned and specify ~0 rather than
-1.

Overall this looks like a good interface, but we need to more
thoroughly define the protocol with the data area and set expectations
we're placing on the user and vendor driver.  There should be no usage
assumptions, it should all be spelled out.  Thanks,

Alex

> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/13] vfio: Add trace events in migration code path
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 13/13] vfio: Add trace events in migration code path Kirti Wankhede
@ 2019-06-20 18:50   ` Dr. David Alan Gilbert
  2019-06-21  5:54     ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-20 18:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	yulei.zhang, cohuck, shuangtai.tst, qemu-devel, zhi.a.wang,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>

Thanks, adding traces really helps; however, it might be easier
if you just add them in your previous patches where you're
adding the functions.

Dave

> ---
>  hw/vfio/migration.c  | 26 ++++++++++++++++++++++++++
>  hw/vfio/trace-events | 18 ++++++++++++++++++
>  2 files changed, 44 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 68775b5dec11..70c03f1a969f 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -21,6 +21,7 @@
>  #include "exec/ramlist.h"
>  #include "exec/ram_addr.h"
>  #include "pci.h"
> +#include "trace.h"
>  
>  /*
>   * Flags used as delimiter:
> @@ -104,6 +105,7 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>      }
>  
>      vbasedev->device_state = state;
> +    trace_vfio_migration_set_state(vbasedev->name, state);
>      return 0;
>  }
>  
> @@ -173,6 +175,8 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>          qemu_put_be64(f, data_size);
>      }
>  
> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +                           migration->pending_bytes);
>      ret = qemu_file_get_error(f);
>  
>      return data_size;
> @@ -195,6 +199,7 @@ static int vfio_update_pending(VFIODevice *vbasedev)
>      }
>  
>      migration->pending_bytes = pending_bytes;
> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
>      return 0;
>  }
>  
> @@ -209,6 +214,8 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>      }
>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>  
> +    trace_vfio_save_device_config_state(vbasedev->name);
> +
>      return qemu_file_get_error(f);
>  }
>  
> @@ -225,6 +232,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>          return -EINVAL;
>      }
>  
> +    trace_vfio_load_device_config_state(vbasedev->name);
>      return qemu_file_get_error(f);
>  }
>  
> @@ -343,6 +351,9 @@ void vfio_get_dirty_page_list(VFIODevice *vbasedev,
>          }
>      } while (count < pfn_count);
>  
> +    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
> +                                   page_size);
> +
>  dpl_unlock:
>      qemu_mutex_unlock(&migration->lock);
>  }
> @@ -390,6 +401,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> +    trace_vfio_save_setup(vbasedev->name);
>      return 0;
>  }
>  
> @@ -401,6 +413,7 @@ static void vfio_save_cleanup(void *opaque)
>      if (migration->region.buffer.mmaps) {
>          vfio_region_unmap(&migration->region.buffer);
>      }
> +    trace_vfio_cleanup(vbasedev->name);
>  }
>  
>  static void vfio_save_pending(QEMUFile *f, void *opaque,
> @@ -424,6 +437,7 @@ static void vfio_save_pending(QEMUFile *f, void *opaque,
>          *res_postcopy_only += migration->pending_bytes;
>      }
>      *res_compatible += 0;
> +    trace_vfio_save_pending(vbasedev->name);
>  }
>  
>  static int vfio_save_iterate(QEMUFile *f, void *opaque)
> @@ -451,6 +465,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> +    trace_vfio_save_iterate(vbasedev->name);
>      return ret;
>  }
>  
> @@ -504,6 +519,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>          error_report("Failed to set state STOPPED");
>          return ret;
>      }
> +
> +    trace_vfio_save_complete_precopy(vbasedev->name);
>      return ret;
>  }
>  
> @@ -544,6 +561,9 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>  
>      data = qemu_get_be64(f);
>      while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +
> +        trace_vfio_load_state(vbasedev->name, data);
> +
>          switch (data) {
>          case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>          {
> @@ -627,6 +647,8 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>                      return -EINVAL;
>                  }
>              }
> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> +                                              data_size);
>              break;
>          }
>          }
> @@ -668,6 +690,7 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>      }
>  
>      vbasedev->vm_running = running;
> +    trace_vfio_vmstate_change(vbasedev->name, running);
>  }
>  
>  static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> @@ -676,6 +699,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>      VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
>      int ret;
>  
> +    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
> +
>      switch (s->state) {
>      case MIGRATION_STATUS_ACTIVE:
>          if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> @@ -758,6 +783,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>              return ret;
>          }
>      } else {
> +        trace_vfio_migration_probe(vbasedev->name, info->index);
>          return vfio_migration_init(vbasedev, info);
>      }
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 8cdc27946cb8..b1f19ae7a806 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -143,3 +143,21 @@ vfio_display_edid_link_up(void) ""
>  vfio_display_edid_link_down(void) ""
>  vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>  vfio_display_edid_write_error(void) ""
> +
> +# migration.c
> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> +vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s), Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
> +vfio_update_pending(char *name, uint64_t pending) " (%s), pending 0x%"PRIx64
> +vfio_save_device_config_state(char *name) " (%s)"
> +vfio_load_device_config_state(char *name) " (%s)"
> +vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
> +vfio_save_setup(char *name) " (%s)"
> +vfio_cleanup(char *name) " (%s)"
> +vfio_save_pending(char *name) " (%s)"
> +vfio_save_iterate(char *name) " (%s)"
> +vfio_save_complete_precopy(char *name) " (%s)"
> +vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
> +vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s), Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_vmstate_change(char *name, int running) " (%s) running %d"
> +vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
> +vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2019-06-20 19:25   ` Alex Williamson
  2019-06-21  6:38     ` Kirti Wankhede
  2019-06-21  0:31   ` Yan Zhao
  2019-06-28  9:09   ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-20 19:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Thu, 20 Jun 2019 20:07:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes
> - read data_offset - indicates kernel driver to write data to staging
>   buffer which is mmapped.

Why is data_offset the trigger rather than data_size?  It seems that
data_offset can't really change dynamically since it might be mmap'd,
so it seems unnatural to bother re-reading it.

> - read data_size - amount of data in bytes written by vendor driver in migration
>   region.
> - if data section is trapped, pread() number of bytes in data_size, from
>   data_offset.
> - if data section is mmaped, read mmaped buffer of size data_size.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes - indicates kernel driver to write data to staging
>    buffer which is mmapped.

Is it pending_bytes or data_offset that triggers the write out of
data?  Why pending_bytes vs data_size?  I was interpreting
pending_bytes as the total data size while data_size is the size
available to read now, so assumed data_size would be more closely
aligned to making the data available.

> c. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> d. if data section is trapped, pread() from data_offset of size data_size.
> e. if data section is mmaped, read mmaped buffer of size data_size.

Should this read as "pread() from data_offset of data_size, or
optionally if mmap is supported on the data area, read data_size from
start of mapped buffer"?  IOW, pread should always work.  Same in
previous section.

> f. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f until (pending_bytes > 0)

s/until/while/

> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> .save_live_iterate runs outside the iothread lock in the migration case, which
> could race with asynchronous call to get dirty page list causing data corruption
> in mapped migration region. Mutex added here to serial migration buffer read
> operation.

Would we be ahead to use different offsets within the region for device
data vs dirty bitmap to avoid this?
 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 212 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index fe0887c27664..0a2f30872316 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>      return 0;
>  }
>  
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t data_offset = 0, data_size = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret != sizeof(data_offset)) {
> +        error_report("Failed to get migration buffer data offset %d",
> +                     ret);
> +        return -EINVAL;
> +    }
> +
> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret != sizeof(data_size)) {
> +        error_report("Failed to get migration buffer data size %d",
> +                     ret);
> +        return -EINVAL;
> +    }
> +
> +    if (data_size > 0) {
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        if (region->mmaps) {
> +            int i;
> +
> +            for (i = 0; i < region->nr_mmaps; i++) {
> +                if ((data_offset >= region->mmaps[i].offset) &&
> +                    (data_offset < region->mmaps[i].offset +
> +                                   region->mmaps[i].size)) {
> +                    buf = region->mmaps[i].mmap + (data_offset -
> +                                                   region->mmaps[i].offset);

So you're expecting that data_offset is somewhere within the data
area.  Why doesn't the data always simply start at the beginning of the
data area?  ie. data_offset would coincide with the beginning of the
mmap'able area (if supported) and be static.  Does this enable some
functionality in the vendor driver?  Does resume data need to be
written from the same offset where it's read?

> +                    buffer_mmaped = true;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        if (!buffer_mmaped) {
> +            buf = g_malloc0(data_size);
> +            ret = pread(vbasedev->fd, buf, data_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != data_size) {
> +                error_report("Failed to get migration data %d", ret);
> +                g_free(buf);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, data_size);
> +        qemu_put_buffer(f, buf, data_size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +        migration->pending_bytes -= data_size;
> +    } else {
> +        qemu_put_be64(f, data_size);
> +    }
> +
> +    ret = qemu_file_get_error(f);
> +
> +    return data_size;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));

Did this trigger the vendor driver to write out to the data area when
we don't need it to?

> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> +        error_report("Failed to get pending bytes %d", ret);
> +        migration->pending_bytes = 0;
> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_save_config(vbasedev, f);
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
>      }
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> +        *res_precopy_only += migration->pending_bytes;
> +    } else {
> +        *res_postcopy_only += migration->pending_bytes;
> +    }
> +    *res_compatible += 0;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    qemu_mutex_lock(&migration->lock);
> +    ret = vfio_save_buffer(f, vbasedev);
> +    qemu_mutex_unlock(&migration->lock);
> +
> +    if (ret < 0) {
> +        error_report("vfio_save_buffer failed %s",
> +                     strerror(errno));
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return ret;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("Failed to set state STOP and SAVING");
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev);
> +        if (ret < 0) {
> +            error_report("Failed to save buffer");
> +            return ret;
> +        } else if (ret == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
> +    if (ret) {
> +        error_report("Failed to set state STOPPED");
> +        return ret;
> +    }
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2019-06-21  0:12   ` Yan Zhao
  2019-06-21  6:44     ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Yan Zhao @ 2019-06-21  0:12 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	yulei.zhang, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, qemu-devel,
	felipe, jonathan.davies, Liu, Changpeng, Ken.Xue

On Thu, Jun 20, 2019 at 10:37:31PM +0800, Kirti Wankhede wrote:
> These functions save and restore PCI device specific data - config
> space of PCI device.
> Tested save and restore with MSI and MSIX type.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.h |  29 +++++++++++++++
>  2 files changed, 141 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index ce3fe96efe2c..09a0821a5b1c 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1187,6 +1187,118 @@ void vfio_pci_write_config(PCIDevice *pdev,
>      }
>  }
>  
> +void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint16_t pci_cmd;
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar;
> +
> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar);
> +    }
> +
> +    qemu_put_be32(f, vdev->interrupt);
> +    if (vdev->interrupt == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                                            2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        qemu_put_be32(f, msi_addr_lo);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                             4);
> +        }
> +        qemu_put_be32(f, msi_addr_hi);
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +        qemu_put_be32(f, msi_data);
> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> +        uint16_t offset;
> +
> +        /* save enable bit and maskall bit */
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        qemu_put_be16(f, offset);
> +        msix_save(pdev, f);
> +    }
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    qemu_put_be16(f, pci_cmd);
> +}
> +
> +void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +    uint16_t pci_cmd;
> +    bool msi_64bit;
> +    int i;
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = qemu_get_be32(f);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> +
> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +        msi_addr_lo = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        msi_addr_hi = qemu_get_be32(f);
> +        if (msi_64bit) {
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +        msi_data = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset = qemu_get_be16(f);
> +
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +        msix_load(pdev, f);
> +    }
> +    pci_cmd = qemu_get_be16(f);
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +}
> +
per the previous discussion, pci config state save/restore are better
defined in fileds of VMStateDescription.


>  /*
>   * Interrupt setup
>   */
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 834a90d64686..847be5f56478 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -19,6 +19,7 @@
>  #include "qemu/queue.h"
>  #include "qemu/timer.h"
>  
> +#ifdef CONFIG_LINUX
>  #define PCI_ANY_ID (~0)
>  
>  struct VFIOPCIDevice;
> @@ -202,4 +203,32 @@ void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>  void vfio_display_finalize(VFIOPCIDevice *vdev);
>  
> +void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f);
> +void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f);
> +
> +static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +
> +    return OBJECT(vdev);
> +}
> +
> +#else
> +static inline void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    g_assert(false);
> +}
> +
> +static inline void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    g_assert(false);
> +}
> +
> +static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
> +{
> +    return NULL;
> +}
> +
> +#endif
> +
>  #endif /* HW_VFIO_VFIO_PCI_H */
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (12 preceding siblings ...)
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 13/13] vfio: Add trace events in migration code path Kirti Wankhede
@ 2019-06-21  0:25 ` Yan Zhao
  2019-06-21  1:24   ` Yan Zhao
  13 siblings, 1 reply; 64+ messages in thread
From: Yan Zhao @ 2019-06-21  0:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	yulei.zhang, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, qemu-devel,
	felipe, jonathan.davies, Liu, Changpeng, Ken.Xue

On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
> Add migration support for VFIO device
> 
> This Patch set include patches as below:
> - Define KABI for VFIO device for migration support.
> - Added save and restore functions for PCI configuration space
> - Generic migration functionality for VFIO device.
>   * This patch set adds functionality only for PCI devices, but can be
>     extended to other VFIO devices.
>   * Added all the basic functions required for pre-copy, stop-and-copy and
>     resume phases of migration.
>   * Added state change notifier and from that notifier function, VFIO
>     device's state changed is conveyed to VFIO device driver.
>   * During save setup phase and resume/load setup phase, migration region
>     is queried and is used to read/write VFIO device data.
>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
>     functionality of iteration during pre-copy phase.
>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>     iteration to read data from VFIO device driver is implemented till pending
>     bytes returned by driver are not zero.
>   * Added function to get dirty pages bitmap for the pages which are used by
>     driver.
> - Add vfio_listerner_log_sync to mark dirty pages.
> - Make VFIO PCI device migration capable. If migration region is not provided by
>   driver, migration is blocked.
> 
> Below is the flow of state change for live migration where states in brackets
> represent VM state, migration state and VFIO device state as:
>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> 
> Live migration save path:
>         QEMU normal running state
>         (RUNNING, _NONE, _RUNNING)
>                         |
>     migrate_init spawns migration_thread.
>     (RUNNING, _SETUP, _RUNNING|_SAVING)
>     Migration thread then calls each device's .save_setup()
>                         |
>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
>     If device is active, get pending bytes by .save_live_pending()
>     if pending bytes >= threshold_size,  call save_live_iterate()
>     Data of VFIO device for pre-copy phase is copied.
>     Iterate till pending bytes converge and are less than threshold
>                         |
>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
>     for each active device. VFIO device is then transitioned in
>      _SAVING state.
>     (FINISH_MIGRATE, _DEVICE, _SAVING)
>     For VFIO device, iterate in  .save_live_complete_precopy  until
>     pending data is 0.
>     (FINISH_MIGRATE, _DEVICE, _STOPPED)

I suggest we also register to VMStateDescription, whose .pre_save
handler would get called after .save_live_complete_precopy in pre-copy
only case, and will called before .save_live_iterate in post-copy
enabled case.
In the .pre_save handler, we can save all device state which must be
copied after device stop in source vm and before device state in target vm.

>                         |
>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
>     Migraton thread schedule cleanup bottom half and exit
> 
> Live migration resume path:
>     Incomming migration calls .load_setup for each device
>     (RESTORE_VM, _ACTIVE, STOPPED)
>                         |
>     For each device, .load_state is called for that device section data
>                         |
>     At the end, called .load_cleanup for each device and vCPUs are started.
>                         |
>         (RUNNING, _NONE, _RUNNING)
> 
> Note that:
> - Migration post copy is not supported.
> 
> v3 -> v4:
> - Added one more bit for _RESUMING flag to be set explicitly.
> - data_offset field is read-only for user space application.
> - data_size is read for every iteration before reading data from migration, that
>   is removed assumption that data will be till end of migration region.
> - If vendor driver supports mappable sparsed region, map those region during
>   setup state of save/load, similarly unmap those from cleanup routines.
> - Handles race condition that causes data corruption in migration region during
>   save device state by adding mutex and serialiaing save_buffer and
>   get_dirty_pages routines.
> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> - Added trace events.
> - Splitted into multiple functional patches.
> 
> v2 -> v3:
> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> - Re-structured vfio_device_migration_info to keep it minimal and defined action
>   on read and write access on its members.
> 
> v1 -> v2:
> - Defined MIGRATION region type and sub-type which should be used with region
>   type capability.
> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
>   offset of migration region.
> - Replaced ioctl with read/write for trapped part of migration region.
> - Added both type of access support, trapped or mmapped, for data section of the
>   region.
> - Moved PCI device functions to pci file.
> - Added iteration to get dirty page bitmap until bitmap for all requested pages
>   are copied.
> 
> Thanks,
> Kirti
> 
> 
> Kirti Wankhede (13):
>   vfio: KABI for migration interface
>   vfio: Add function to unmap VFIO region
>   vfio: Add save and load functions for VFIO PCI devices
>   vfio: Add migration region initialization and finalize function
>   vfio: Add VM state change handler to know state of VM
>   vfio: Add migration state change notifier
>   vfio: Register SaveVMHandlers for VFIO device
>   vfio: Add save state functions to SaveVMHandlers
>   vfio: Add load state functions to SaveVMHandlers
>   vfio: Add function to get dirty page list
>   vfio: Add vfio_listerner_log_sync to mark dirty pages
>   vfio: Make vfio-pci device migration capable.
>   vfio: Add trace events in migration code path
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  55 +++
>  hw/vfio/migration.c           | 815 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 | 126 ++++++-
>  hw/vfio/pci.h                 |  29 ++
>  hw/vfio/trace-events          |  19 +
>  include/hw/vfio/vfio-common.h |  22 ++
>  linux-headers/linux/vfio.h    |  71 ++++
>  8 files changed, 1132 insertions(+), 7 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
  2019-06-20 19:25   ` Alex Williamson
@ 2019-06-21  0:31   ` Yan Zhao
  2019-06-25  3:30     ` Yan Zhao
  2019-06-28  9:09   ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 64+ messages in thread
From: Yan Zhao @ 2019-06-21  0:31 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	yulei.zhang, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, qemu-devel,
	felipe, jonathan.davies, Liu, Changpeng, Ken.Xue

On Thu, Jun 20, 2019 at 10:37:36PM +0800, Kirti Wankhede wrote:
> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes
> - read data_offset - indicates kernel driver to write data to staging
>   buffer which is mmapped.
> - read data_size - amount of data in bytes written by vendor driver in migration
>   region.
> - if data section is trapped, pread() number of bytes in data_size, from
>   data_offset.
> - if data section is mmaped, read mmaped buffer of size data_size.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes - indicates kernel driver to write data to staging
>    buffer which is mmapped.
> c. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> d. if data section is trapped, pread() from data_offset of size data_size.
> e. if data section is mmaped, read mmaped buffer of size data_size.
> f. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f until (pending_bytes > 0)
> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> .save_live_iterate runs outside the iothread lock in the migration case, which
> could race with asynchronous call to get dirty page list causing data corruption
> in mapped migration region. Mutex added here to serial migration buffer read
> operation.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 212 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index fe0887c27664..0a2f30872316 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>      return 0;
>  }
>  
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t data_offset = 0, data_size = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret != sizeof(data_offset)) {
> +        error_report("Failed to get migration buffer data offset %d",
> +                     ret);
> +        return -EINVAL;
> +    }
> +
> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret != sizeof(data_size)) {
> +        error_report("Failed to get migration buffer data size %d",
> +                     ret);
> +        return -EINVAL;
> +    }
> +
how big is the data_size ? 
if this size is too big, it may take too much time and block others.

> +    if (data_size > 0) {
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        if (region->mmaps) {
> +            int i;
> +
> +            for (i = 0; i < region->nr_mmaps; i++) {
> +                if ((data_offset >= region->mmaps[i].offset) &&
> +                    (data_offset < region->mmaps[i].offset +
> +                                   region->mmaps[i].size)) {
> +                    buf = region->mmaps[i].mmap + (data_offset -
> +                                                   region->mmaps[i].offset);
> +                    buffer_mmaped = true;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        if (!buffer_mmaped) {
> +            buf = g_malloc0(data_size);
> +            ret = pread(vbasedev->fd, buf, data_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != data_size) {
> +                error_report("Failed to get migration data %d", ret);
> +                g_free(buf);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, data_size);
> +        qemu_put_buffer(f, buf, data_size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +        migration->pending_bytes -= data_size;
> +    } else {
> +        qemu_put_be64(f, data_size);
> +    }
> +
> +    ret = qemu_file_get_error(f);
> +
> +    return data_size;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> +        error_report("Failed to get pending bytes %d", ret);
> +        migration->pending_bytes = 0;
> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_save_config(vbasedev, f);
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
>      }
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> +        *res_precopy_only += migration->pending_bytes;
> +    } else {
> +        *res_postcopy_only += migration->pending_bytes;
> +    }
> +    *res_compatible += 0;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    qemu_mutex_lock(&migration->lock);
> +    ret = vfio_save_buffer(f, vbasedev);
> +    qemu_mutex_unlock(&migration->lock);
> +
> +    if (ret < 0) {
> +        error_report("vfio_save_buffer failed %s",
> +                     strerror(errno));
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return ret;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("Failed to set state STOP and SAVING");
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev);
> +        if (ret < 0) {
> +            error_report("Failed to save buffer");
> +            return ret;
> +        } else if (ret == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
> +    if (ret) {
> +        error_report("Failed to set state STOPPED");
> +        return ret;
> +    }
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-21  0:25 ` [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Yan Zhao
@ 2019-06-21  1:24   ` Yan Zhao
  2019-06-21  8:02     ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Yan Zhao @ 2019-06-21  1:24 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	yulei.zhang, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, qemu-devel,
	felipe, jonathan.davies, Liu, Changpeng, Ken.Xue

On Fri, Jun 21, 2019 at 08:25:18AM +0800, Yan Zhao wrote:
> On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
> > Add migration support for VFIO device
> > 
> > This Patch set include patches as below:
> > - Define KABI for VFIO device for migration support.
> > - Added save and restore functions for PCI configuration space
> > - Generic migration functionality for VFIO device.
> >   * This patch set adds functionality only for PCI devices, but can be
> >     extended to other VFIO devices.
> >   * Added all the basic functions required for pre-copy, stop-and-copy and
> >     resume phases of migration.
> >   * Added state change notifier and from that notifier function, VFIO
> >     device's state changed is conveyed to VFIO device driver.
> >   * During save setup phase and resume/load setup phase, migration region
> >     is queried and is used to read/write VFIO device data.
> >   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> >     functionality of iteration during pre-copy phase.
> >   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> >     iteration to read data from VFIO device driver is implemented till pending
> >     bytes returned by driver are not zero.
> >   * Added function to get dirty pages bitmap for the pages which are used by
> >     driver.
> > - Add vfio_listerner_log_sync to mark dirty pages.
> > - Make VFIO PCI device migration capable. If migration region is not provided by
> >   driver, migration is blocked.
> > 
> > Below is the flow of state change for live migration where states in brackets
> > represent VM state, migration state and VFIO device state as:
> >     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > 
> > Live migration save path:
> >         QEMU normal running state
> >         (RUNNING, _NONE, _RUNNING)
> >                         |
> >     migrate_init spawns migration_thread.
> >     (RUNNING, _SETUP, _RUNNING|_SAVING)
> >     Migration thread then calls each device's .save_setup()
> >                         |
> >     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> >     If device is active, get pending bytes by .save_live_pending()
> >     if pending bytes >= threshold_size,  call save_live_iterate()
> >     Data of VFIO device for pre-copy phase is copied.
> >     Iterate till pending bytes converge and are less than threshold
> >                         |
> >     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> >     for each active device. VFIO device is then transitioned in
> >      _SAVING state.
> >     (FINISH_MIGRATE, _DEVICE, _SAVING)
> >     For VFIO device, iterate in  .save_live_complete_precopy  until
> >     pending data is 0.
> >     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> 
> I suggest we also register to VMStateDescription, whose .pre_save
> handler would get called after .save_live_complete_precopy in pre-copy
> only case, and will called before .save_live_iterate in post-copy
> enabled case.
> In the .pre_save handler, we can save all device state which must be
> copied after device stop in source vm and before device start in target vm.
> 
hi
to better describe this idea:

in pre-copy only case, the flow is

start migration --> .save_live_iterate (several round) -> stop source vm
--> .save_live_complete_precopy --> .pre_save  -->start target vm
-->migration complete


in post-copy enabled case, the flow is

start migration --> .save_live_iterate (several round) --> start post copy --> 
stop source vm --> .pre_save --> start target vm --> .save_live_iterate (several round) 
-->migration complete

Therefore, we should put saving of device state in .pre_save interface
rather than in .save_live_complete_precopy. 
The device state includes pci config data, page tables, register state, etc.

The .save_live_iterate and .save_live_complete_precopy should only deal
with saving dirty memory.


I know current implementation does not support post-copy. but at least
it should not require huge change when we decide to enable it in future.

Thanks
Yan

> >                         |
> >     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> >     Migraton thread schedule cleanup bottom half and exit
> > 
> > Live migration resume path:
> >     Incomming migration calls .load_setup for each device
> >     (RESTORE_VM, _ACTIVE, STOPPED)
> >                         |
> >     For each device, .load_state is called for that device section data
> >                         |
> >     At the end, called .load_cleanup for each device and vCPUs are started.
> >                         |
> >         (RUNNING, _NONE, _RUNNING)
> > 
> > Note that:
> > - Migration post copy is not supported.
> > 
> > v3 -> v4:
> > - Added one more bit for _RESUMING flag to be set explicitly.
> > - data_offset field is read-only for user space application.
> > - data_size is read for every iteration before reading data from migration, that
> >   is removed assumption that data will be till end of migration region.
> > - If vendor driver supports mappable sparsed region, map those region during
> >   setup state of save/load, similarly unmap those from cleanup routines.
> > - Handles race condition that causes data corruption in migration region during
> >   save device state by adding mutex and serialiaing save_buffer and
> >   get_dirty_pages routines.
> > - Skip called get_dirty_pages routine for mapped MMIO region of device.
> > - Added trace events.
> > - Splitted into multiple functional patches.
> > 
> > v2 -> v3:
> > - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> > - Re-structured vfio_device_migration_info to keep it minimal and defined action
> >   on read and write access on its members.
> > 
> > v1 -> v2:
> > - Defined MIGRATION region type and sub-type which should be used with region
> >   type capability.
> > - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> >   offset of migration region.
> > - Replaced ioctl with read/write for trapped part of migration region.
> > - Added both type of access support, trapped or mmapped, for data section of the
> >   region.
> > - Moved PCI device functions to pci file.
> > - Added iteration to get dirty page bitmap until bitmap for all requested pages
> >   are copied.
> > 
> > Thanks,
> > Kirti
> > 
> > 
> > Kirti Wankhede (13):
> >   vfio: KABI for migration interface
> >   vfio: Add function to unmap VFIO region
> >   vfio: Add save and load functions for VFIO PCI devices
> >   vfio: Add migration region initialization and finalize function
> >   vfio: Add VM state change handler to know state of VM
> >   vfio: Add migration state change notifier
> >   vfio: Register SaveVMHandlers for VFIO device
> >   vfio: Add save state functions to SaveVMHandlers
> >   vfio: Add load state functions to SaveVMHandlers
> >   vfio: Add function to get dirty page list
> >   vfio: Add vfio_listerner_log_sync to mark dirty pages
> >   vfio: Make vfio-pci device migration capable.
> >   vfio: Add trace events in migration code path
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  55 +++
> >  hw/vfio/migration.c           | 815 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 | 126 ++++++-
> >  hw/vfio/pci.h                 |  29 ++
> >  hw/vfio/trace-events          |  19 +
> >  include/hw/vfio/vfio-common.h |  22 ++
> >  linux-headers/linux/vfio.h    |  71 ++++
> >  8 files changed, 1132 insertions(+), 7 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > -- 
> > 2.7.0
> > 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-20 17:18   ` Alex Williamson
@ 2019-06-21  5:52     ` Kirti Wankhede
  2019-06-21 15:03       ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue



On 6/20/2019 10:48 PM, Alex Williamson wrote:
> On Thu, 20 Jun 2019 20:07:29 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Defined MIGRATION region type and sub-type.
>> - Used 3 bits to define VFIO device states.
>>     Bit 0 => _RUNNING
>>     Bit 1 => _SAVING
>>     Bit 2 => _RESUMING
>>     Combination of these bits defines VFIO device's state during migration
>>     _STOPPED => All bits 0 indicates VFIO device stopped.
>>     _RUNNING => Normal VFIO device running state.
>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>>                           saving state of device i.e. pre-copy state
>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>>                           save device state,i.e. stop-n-copy state
>>     _RESUMING => VFIO device resuming state.
>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>   offset of migration region to get/set VFIO device related information.
>>   Defined members of structure and usage on read/write access:
>>     * device_state: (read/write)
>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>>         as of now.
>>     * pending bytes: (read only)
>>         To get pending bytes yet to be migrated for VFIO device.
>>     * data_offset: (read only)
>>         To get data offset in migration from where data exist during _SAVING
>>         and from where data should be written by user space application during
>>          _RESUMING state
>>     * data_size: (read/write)
>>         To get and set size of data copied in migration region during _SAVING
>>         and _RESUMING state.
>>     * start_pfn, page_size, total_pfns: (write only)
>>         To get bitmap of dirty pages from vendor driver from given
>>         start address for total_pfns.
>>     * copied_pfns: (read only)
>>         To get number of pfns bitmap copied in migration region.
>>         Vendor driver should copy the bitmap with bits set only for
>>         pages to be marked dirty in migration region. Vendor driver
>>         should return 0 if there are 0 pages dirty in requested
>>         range. Vendor driver should return -1 to mark all pages in the section
>>         as dirty
>>
>> Migration region looks like:
>>  ------------------------------------------------------------------
>> |vfio_device_migration_info|    data section                      |
>> |                          |     ///////////////////////////////  |
>>  ------------------------------------------------------------------
>>  ^                              ^                              ^
>>  offset 0-trapped part        data_offset                 data_size
>>
>> Data section is always followed by vfio_device_migration_info
>> structure in the region, so data_offset will always be none-0.
>> Offset from where data is copied is decided by kernel driver, data
>> section can be trapped or mapped depending on how kernel driver
>> defines data section. If mmapped, then data_offset should be page
>> aligned, where as initial section which contain
>> vfio_device_migration_info structure might not end at offset which
>> is page aligned.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 71 insertions(+)
>>
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 24f505199f83..274ec477eb82 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>>   */
>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>>  
>> +/* Migration region type and sub-type */
>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>> +
>> +/**
>> + * Structure vfio_device_migration_info is placed at 0th offset of
>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>> + * information. Field accesses from this structure are only supported at their
>> + * native width and alignment, otherwise should return error.
>> + *
>> + * device_state: (read/write)
>> + *      To indicate vendor driver the state VFIO device should be transitioned
>> + *      to. If device state transition fails, write to this field return error.
>> + *      It consists of 3 bits:
>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>> + *        device before write returns.
>> + *      - If bit 1 set, indicates _SAVING state.
>> + *      - If bit 2 set, indicates _RESUMING state.
>> + *
>> + * pending bytes: (read only)
>> + *      Read pending bytes yet to be migrated from vendor driver
>> + *
>> + * data_offset: (read only)
>> + *      User application should read data_offset in migration region from where
>> + *      user application should read data during _SAVING state or write data
>> + *      during _RESUMING state.
>> + *
>> + * data_size: (read/write)
>> + *      User application should read data_size to know data copied in migration
>> + *      region during _SAVING state and write size of data copied in migration
>> + *      region during _RESUMING state.
>> + *
>> + * start_pfn: (write only)
>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
>> + *      _SAVING state.
>> + *
>> + * page_size: (write only)
>> + *      User application should write the page_size of pfn.
>> + *
>> + * total_pfns: (write only)
>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
>> + *
>> + * copied_pfns: (read only)
>> + *      pfn count for which dirty bitmap is copied to migration region.
>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
>> + *      marked dirty in migration region.
>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
>> + *      range.
>> + *      Vendor driver should return -1 to mark all pages in the section as
>> + *      dirty.
> 
> Is the protocol that the user writes start_pfn/page_size/total_pfns in
> any order and then the read of copied_pfns is what triggers the
> snapshot?

Yes.

>  Are start_pfn/page_size/total_pfns sticky such that a user
> can write them once and get repeated refreshes of the dirty bitmap by
> re-reading copied_pfns?

Yes and that bitmap should be for given range (from start_pfn till
start_pfn + tolal_pfns).
Re-reading of copied_pfns is to handle the case where it might be
possible that vendor driver reserved area for bitmap < total bitmap size
for range (start_pfn to start_pfn + tolal_pfns), then user will have to
iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
is, there are no pages dirty in rest of the range)

>  What's the advantage to returning -1 versus
> returning copied_pfns == total_pfns?
> 

If all bits in bitmap are 1, then return -1, that is, all pages in the
given range to be marked dirty.

If all bits in bitmap are 0, then return 0, that is, no page to be
marked dirty in given range or rest of the range.

Otherwise vendor driver should return copied_pfns == total_pfn and
provide bitmap for total_pfn, which means that bitmap copied for given
range contains information for all pages where some bits are 0s and some
are 1s.

> If the user then wants to switch back to reading device migration
> state, is it a read of data_size that switches the data area back to
> making that address space available? 

No, Its not just read(data_size), before that there is a
read(data_offset). If Vendor driver wants to have different sub-regions
for device data and dirty page bitmap, vendor driver should return
corresponding offset on read(data_offset).

> In each case, is it the user's
> responsibility to consume all the data provided before triggering the
> next data area?> For example, if I ask for a range of dirty bitmap, the
> vendor driver will provide that range and and clear it, such that the
> pages are considered clean regardless of whether the user consumed the
> data area.  

Yes.

> Likewise if the user asks for data_size, that would be
> deducted from pending_bytes regardless of the user reading the data
> area. 

User should read data before deducting data_size from pending_bytes.
From vendor driver point of view, data_size will be deducted from
pending_bytes once data is copied to data region.

> Are there any read side-effects to pending_bytes?

No, its query to vendor driver about pending bytes yet to be
migrated/read from vendor driver.

>  Are there
> read side-effects to the data area on SAVING?

No.

>  Are there write
> side-effects on RESUMING, or is it only the write of data_size that
> triggers the buffer to be consumed?

Its write(data_size) triggers the buffer to be consumed, if region is
mmaped, then data is already copied to region, if its trapped then
following writes from data_offset is data to be consumed.

>  Is it the user's responsibility to
> write only full "packets" on RESUMING?  For example if the SAVING side
> provides data_size X, that full data_size X must be written to the
> RESUMING side, the user cannot write half of it to the data area on the
> RESUMING side, write data_size with X/2, write the second half, and
> again write X/2.  IOW, the data_size "packet" is indivisible at the
> point of resuming.
> 

If source and destination are compatible or of same driver version, then
if user is reading data_size X at source/SAVING, destination should be
able to consume data_size X at restoring/RESUMING. Then why should user
write X/2 and iterate?

> What are the ordering requirements?  Must the user write data_size
> packets in the same order that they're read, or is it the vendor
> driver's responsibility to include sequence information and allow
> restore in any order?
> 

For user, data is opaque. User should write data in the same order as he
received.

>> + */
>> +
>> +struct vfio_device_migration_info {
>> +        __u32 device_state;         /* VFIO device state */
>> +#define VFIO_DEVICE_STATE_STOPPED   (0)
> 
> We need to be careful with how this is used if we want to leave the
> possibility of using the remaining 29 bits of this register.  Maybe we
> want to define VFIO_DEVICE_STATE_MASK and be sure that we only do
> read-modify-write ops within the mask (ex. set_bit and clear_bit
> helpers).

Makes sense, I'll do changes in next iteration.

>  Also, above we define STOPPED to indicate simply
> not-RUNNING, but here it seems STOPPED means not-RUNNING, not-SAVING,
> and not-RESUMING.
> 

That's correct.

>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
>> +                                     VFIO_DEVICE_STATE_RESUMING)
>> +        __u32 reserved;
>> +        __u64 pending_bytes;
>> +        __u64 data_offset;
> 
> Placing the data more than 4GB into the region seems a bit absurd, so
> this could probably be a __u32 and take the place of the reserved field.
> 

Is there a maximum limit on VFIO region size?
There isn't any such limit, right? Vendor driver can define region of
any size and then place data section anywhere in the region. I prefer to
keep it __u64.

>> +        __u64 data_size;
>> +        __u64 start_pfn;
>> +        __u64 page_size;
>> +        __u64 total_pfns;
>> +        __s64 copied_pfns;
> 
> If this is signed so that we can get -1 then the user could
> theoretically specify total_pfns that we can't represent in
> copied_pfns.  Probably best to use unsigned and specify ~0 rather than
> -1.
> 

Ok.

> Overall this looks like a good interface, but we need to more
> thoroughly define the protocol with the data area and set expectations
> we're placing on the user and vendor driver.  There should be no usage
> assumptions, it should all be spelled out.  Thanks,
>

Thanks for your feedback. I'll update comments above to be more specific.

Thanks,
Kirti

> Alex
> 
>> +} __attribute__((packed));
>> +
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>   * which allows direct access to non-MSIX registers which happened to be within
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/13] vfio: Add trace events in migration code path
  2019-06-20 18:50   ` Dr. David Alan Gilbert
@ 2019-06-21  5:54     ` Kirti Wankhede
  0 siblings, 0 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21  5:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	yulei.zhang, cohuck, shuangtai.tst, qemu-devel, zhi.a.wang,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue



On 6/21/2019 12:20 AM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> 
> Thanks, adding traces really helps; however, it might be easier
> if you just add them in your previous patches where you're
> adding the functions.
> 

Ok. I'll change it.

Thanks,
Kirti

> Dave
> 
>> ---
>>  hw/vfio/migration.c  | 26 ++++++++++++++++++++++++++
>>  hw/vfio/trace-events | 18 ++++++++++++++++++
>>  2 files changed, 44 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 68775b5dec11..70c03f1a969f 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -21,6 +21,7 @@
>>  #include "exec/ramlist.h"
>>  #include "exec/ram_addr.h"
>>  #include "pci.h"
>> +#include "trace.h"
>>  
>>  /*
>>   * Flags used as delimiter:
>> @@ -104,6 +105,7 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>      }
>>  
>>      vbasedev->device_state = state;
>> +    trace_vfio_migration_set_state(vbasedev->name, state);
>>      return 0;
>>  }
>>  
>> @@ -173,6 +175,8 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>>          qemu_put_be64(f, data_size);
>>      }
>>  
>> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
>> +                           migration->pending_bytes);
>>      ret = qemu_file_get_error(f);
>>  
>>      return data_size;
>> @@ -195,6 +199,7 @@ static int vfio_update_pending(VFIODevice *vbasedev)
>>      }
>>  
>>      migration->pending_bytes = pending_bytes;
>> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
>>      return 0;
>>  }
>>  
>> @@ -209,6 +214,8 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>>      }
>>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>  
>> +    trace_vfio_save_device_config_state(vbasedev->name);
>> +
>>      return qemu_file_get_error(f);
>>  }
>>  
>> @@ -225,6 +232,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>          return -EINVAL;
>>      }
>>  
>> +    trace_vfio_load_device_config_state(vbasedev->name);
>>      return qemu_file_get_error(f);
>>  }
>>  
>> @@ -343,6 +351,9 @@ void vfio_get_dirty_page_list(VFIODevice *vbasedev,
>>          }
>>      } while (count < pfn_count);
>>  
>> +    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
>> +                                   page_size);
>> +
>>  dpl_unlock:
>>      qemu_mutex_unlock(&migration->lock);
>>  }
>> @@ -390,6 +401,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>          return ret;
>>      }
>>  
>> +    trace_vfio_save_setup(vbasedev->name);
>>      return 0;
>>  }
>>  
>> @@ -401,6 +413,7 @@ static void vfio_save_cleanup(void *opaque)
>>      if (migration->region.buffer.mmaps) {
>>          vfio_region_unmap(&migration->region.buffer);
>>      }
>> +    trace_vfio_cleanup(vbasedev->name);
>>  }
>>  
>>  static void vfio_save_pending(QEMUFile *f, void *opaque,
>> @@ -424,6 +437,7 @@ static void vfio_save_pending(QEMUFile *f, void *opaque,
>>          *res_postcopy_only += migration->pending_bytes;
>>      }
>>      *res_compatible += 0;
>> +    trace_vfio_save_pending(vbasedev->name);
>>  }
>>  
>>  static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> @@ -451,6 +465,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>          return ret;
>>      }
>>  
>> +    trace_vfio_save_iterate(vbasedev->name);
>>      return ret;
>>  }
>>  
>> @@ -504,6 +519,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>          error_report("Failed to set state STOPPED");
>>          return ret;
>>      }
>> +
>> +    trace_vfio_save_complete_precopy(vbasedev->name);
>>      return ret;
>>  }
>>  
>> @@ -544,6 +561,9 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>  
>>      data = qemu_get_be64(f);
>>      while (data != VFIO_MIG_FLAG_END_OF_STATE) {
>> +
>> +        trace_vfio_load_state(vbasedev->name, data);
>> +
>>          switch (data) {
>>          case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>>          {
>> @@ -627,6 +647,8 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>                      return -EINVAL;
>>                  }
>>              }
>> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
>> +                                              data_size);
>>              break;
>>          }
>>          }
>> @@ -668,6 +690,7 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>      }
>>  
>>      vbasedev->vm_running = running;
>> +    trace_vfio_vmstate_change(vbasedev->name, running);
>>  }
>>  
>>  static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>> @@ -676,6 +699,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>>      VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
>>      int ret;
>>  
>> +    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
>> +
>>      switch (s->state) {
>>      case MIGRATION_STATUS_ACTIVE:
>>          if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
>> @@ -758,6 +783,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>>              return ret;
>>          }
>>      } else {
>> +        trace_vfio_migration_probe(vbasedev->name, info->index);
>>          return vfio_migration_init(vbasedev, info);
>>      }
>>  
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 8cdc27946cb8..b1f19ae7a806 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -143,3 +143,21 @@ vfio_display_edid_link_up(void) ""
>>  vfio_display_edid_link_down(void) ""
>>  vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>>  vfio_display_edid_write_error(void) ""
>> +
>> +# migration.c
>> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>> +vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s), Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
>> +vfio_update_pending(char *name, uint64_t pending) " (%s), pending 0x%"PRIx64
>> +vfio_save_device_config_state(char *name) " (%s)"
>> +vfio_load_device_config_state(char *name) " (%s)"
>> +vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
>> +vfio_save_setup(char *name) " (%s)"
>> +vfio_cleanup(char *name) " (%s)"
>> +vfio_save_pending(char *name) " (%s)"
>> +vfio_save_iterate(char *name) " (%s)"
>> +vfio_save_complete_precopy(char *name) " (%s)"
>> +vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
>> +vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s), Offset 0x%"PRIx64" size 0x%"PRIx64
>> +vfio_vmstate_change(char *name, int running) " (%s) running %d"
>> +vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>> +vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-20 19:25   ` Alex Williamson
@ 2019-06-21  6:38     ` Kirti Wankhede
  2019-06-21 15:16       ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21  6:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue



On 6/21/2019 12:55 AM, Alex Williamson wrote:
> On Thu, 20 Jun 2019 20:07:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>> functions. These functions handles pre-copy and stop-and-copy phase.
>>
>> In _SAVING|_RUNNING device state or pre-copy phase:
>> - read pending_bytes
>> - read data_offset - indicates kernel driver to write data to staging
>>   buffer which is mmapped.
> 
> Why is data_offset the trigger rather than data_size?  It seems that
> data_offset can't really change dynamically since it might be mmap'd,
> so it seems unnatural to bother re-reading it.
> 

Vendor driver can change data_offset, he can have different data_offset
for device data and dirty pages bitmap.

>> - read data_size - amount of data in bytes written by vendor driver in migration
>>   region.
>> - if data section is trapped, pread() number of bytes in data_size, from
>>   data_offset.
>> - if data section is mmaped, read mmaped buffer of size data_size.
>> - Write data packet to file stream as below:
>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>> VFIO_MIG_FLAG_END_OF_STATE }
>>
>> In _SAVING device state or stop-and-copy phase
>> a. read config space of device and save to migration file stream. This
>>    doesn't need to be from vendor driver. Any other special config state
>>    from driver can be saved as data in following iteration.
>> b. read pending_bytes - indicates kernel driver to write data to staging
>>    buffer which is mmapped.
> 
> Is it pending_bytes or data_offset that triggers the write out of
> data?  Why pending_bytes vs data_size?  I was interpreting
> pending_bytes as the total data size while data_size is the size
> available to read now, so assumed data_size would be more closely
> aligned to making the data available.
> 

Sorry, that's my mistake while editing, its read data_offset as in above
case.

>> c. read data_size - amount of data in bytes written by vendor driver in
>>    migration region.
>> d. if data section is trapped, pread() from data_offset of size data_size.
>> e. if data section is mmaped, read mmaped buffer of size data_size.
> 
> Should this read as "pread() from data_offset of data_size, or
> optionally if mmap is supported on the data area, read data_size from
> start of mapped buffer"?  IOW, pread should always work.  Same in
> previous section.
> 

ok. I'll update.

>> f. Write data packet as below:
>>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>> g. iterate through steps b to f until (pending_bytes > 0)
> 
> s/until/while/

Ok.

> 
>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>
>> .save_live_iterate runs outside the iothread lock in the migration case, which
>> could race with asynchronous call to get dirty page list causing data corruption
>> in mapped migration region. Mutex added here to serial migration buffer read
>> operation.
> 
> Would we be ahead to use different offsets within the region for device
> data vs dirty bitmap to avoid this?
>

Lock will still be required to serialize the read/write operations on
vfio_device_migration_info structure in the region.


>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 212 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index fe0887c27664..0a2f30872316 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>      return 0;
>>  }
>>  
>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint64_t data_offset = 0, data_size = 0;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +    if (ret != sizeof(data_offset)) {
>> +        error_report("Failed to get migration buffer data offset %d",
>> +                     ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_size));
>> +    if (ret != sizeof(data_size)) {
>> +        error_report("Failed to get migration buffer data size %d",
>> +                     ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (data_size > 0) {
>> +        void *buf = NULL;
>> +        bool buffer_mmaped = false;
>> +
>> +        if (region->mmaps) {
>> +            int i;
>> +
>> +            for (i = 0; i < region->nr_mmaps; i++) {
>> +                if ((data_offset >= region->mmaps[i].offset) &&
>> +                    (data_offset < region->mmaps[i].offset +
>> +                                   region->mmaps[i].size)) {
>> +                    buf = region->mmaps[i].mmap + (data_offset -
>> +                                                   region->mmaps[i].offset);
> 
> So you're expecting that data_offset is somewhere within the data
> area.  Why doesn't the data always simply start at the beginning of the
> data area?  ie. data_offset would coincide with the beginning of the
> mmap'able area (if supported) and be static.  Does this enable some
> functionality in the vendor driver?

Do you want to enforce that to vendor driver?
From the feedback on previous version I thought vendor driver should
define data_offset within the region
"I'd suggest that the vendor driver expose a read-only
data_offset that matches a sparse mmap capability entry should the
driver support mmap.  The use should always read or write data from the
vendor defined data_offset"

This also adds flexibility to vendor driver such that vendor driver can
define different data_offset for device data and dirty page bitmap
within same mmaped region.

>  Does resume data need to be
> written from the same offset where it's read?

No, resume data should be written from the data_offset that vendor
driver provided during resume.

> 
>> +                    buffer_mmaped = true;
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +
>> +        if (!buffer_mmaped) {
>> +            buf = g_malloc0(data_size);
>> +            ret = pread(vbasedev->fd, buf, data_size,
>> +                        region->fd_offset + data_offset);
>> +            if (ret != data_size) {
>> +                error_report("Failed to get migration data %d", ret);
>> +                g_free(buf);
>> +                return -EINVAL;
>> +            }
>> +        }
>> +
>> +        qemu_put_be64(f, data_size);
>> +        qemu_put_buffer(f, buf, data_size);
>> +
>> +        if (!buffer_mmaped) {
>> +            g_free(buf);
>> +        }
>> +        migration->pending_bytes -= data_size;
>> +    } else {
>> +        qemu_put_be64(f, data_size);
>> +    }
>> +
>> +    ret = qemu_file_get_error(f);
>> +
>> +    return data_size;
>> +}
>> +
>> +static int vfio_update_pending(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint64_t pending_bytes = 0;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             pending_bytes));
> 
> Did this trigger the vendor driver to write out to the data area when
> we don't need it to?
> 

No, as I mentioned above, I'll update the description.

Thanks,
Kirti

>> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
>> +        error_report("Failed to get pending bytes %d", ret);
>> +        migration->pending_bytes = 0;
>> +        return (ret < 0) ? ret : -EINVAL;
>> +    }
>> +
>> +    migration->pending_bytes = pending_bytes;
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>> +
>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> +        vfio_pci_save_config(vbasedev, f);
>> +    }
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>  /* ---------------------------------------------------------------------- */
>>  
>>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
>>      }
>>  }
>>  
>> +static void vfio_save_pending(QEMUFile *f, void *opaque,
>> +                              uint64_t threshold_size,
>> +                              uint64_t *res_precopy_only,
>> +                              uint64_t *res_compatible,
>> +                              uint64_t *res_postcopy_only)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return;
>> +    }
>> +
>> +    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
>> +        *res_precopy_only += migration->pending_bytes;
>> +    } else {
>> +        *res_postcopy_only += migration->pending_bytes;
>> +    }
>> +    *res_compatible += 0;
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +
>> +    qemu_mutex_lock(&migration->lock);
>> +    ret = vfio_save_buffer(f, vbasedev);
>> +    qemu_mutex_unlock(&migration->lock);
>> +
>> +    if (ret < 0) {
>> +        error_report("vfio_save_buffer failed %s",
>> +                     strerror(errno));
>> +        return ret;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("Failed to set state STOP and SAVING");
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_save_device_config_state(f, opaque);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    while (migration->pending_bytes > 0) {
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +        ret = vfio_save_buffer(f, vbasedev);
>> +        if (ret < 0) {
>> +            error_report("Failed to save buffer");
>> +            return ret;
>> +        } else if (ret == 0) {
>> +            break;
>> +        }
>> +
>> +        ret = vfio_update_pending(vbasedev);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
>> +    if (ret) {
>> +        error_report("Failed to set state STOPPED");
>> +        return ret;
>> +    }
>> +    return ret;
>> +}
>> +
>>  static SaveVMHandlers savevm_vfio_handlers = {
>>      .save_setup = vfio_save_setup,
>>      .save_cleanup = vfio_save_cleanup,
>> +    .save_live_pending = vfio_save_pending,
>> +    .save_live_iterate = vfio_save_iterate,
>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>>  };
>>  
>>  /* ---------------------------------------------------------------------- */
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices
  2019-06-21  0:12   ` Yan Zhao
@ 2019-06-21  6:44     ` Kirti Wankhede
  2019-06-21  7:50       ` Yan Zhao
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21  6:44 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 6/21/2019 5:42 AM, Yan Zhao wrote:
> On Thu, Jun 20, 2019 at 10:37:31PM +0800, Kirti Wankhede wrote:
>> These functions save and restore PCI device specific data - config
>> space of PCI device.
>> Tested save and restore with MSI and MSIX type.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/pci.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/pci.h |  29 +++++++++++++++
>>  2 files changed, 141 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index ce3fe96efe2c..09a0821a5b1c 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -1187,6 +1187,118 @@ void vfio_pci_write_config(PCIDevice *pdev,
>>      }
>>  }
>>  
>> +void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint16_t pci_cmd;
>> +    int i;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar;
>> +
>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>> +        qemu_put_be32(f, bar);
>> +    }
>> +
>> +    qemu_put_be32(f, vdev->interrupt);
>> +    if (vdev->interrupt == VFIO_INT_MSI) {
>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +        bool msi_64bit;
>> +
>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                                            2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        msi_addr_lo = pci_default_read_config(pdev,
>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>> +        qemu_put_be32(f, msi_addr_lo);
>> +
>> +        if (msi_64bit) {
>> +            msi_addr_hi = pci_default_read_config(pdev,
>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                             4);
>> +        }
>> +        qemu_put_be32(f, msi_addr_hi);
>> +
>> +        msi_data = pci_default_read_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                2);
>> +        qemu_put_be32(f, msi_data);
>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
>> +        uint16_t offset;
>> +
>> +        /* save enable bit and maskall bit */
>> +        offset = pci_default_read_config(pdev,
>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>> +        qemu_put_be16(f, offset);
>> +        msix_save(pdev, f);
>> +    }
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    qemu_put_be16(f, pci_cmd);
>> +}
>> +
>> +void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint32_t interrupt_type;
>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +    uint16_t pci_cmd;
>> +    bool msi_64bit;
>> +    int i;
>> +
>> +    /* retore pci bar configuration */
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar = qemu_get_be32(f);
>> +
>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>> +    }
>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>> +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
>> +
>> +    interrupt_type = qemu_get_be32(f);
>> +
>> +    if (interrupt_type == VFIO_INT_MSI) {
>> +        /* restore msi configuration */
>> +        msi_flags = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>> +
>> +        msi_addr_lo = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>> +                              msi_addr_lo, 4);
>> +
>> +        msi_addr_hi = qemu_get_be32(f);
>> +        if (msi_64bit) {
>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                  msi_addr_hi, 4);
>> +        }
>> +        msi_data = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                msi_data, 2);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>> +        uint16_t offset = qemu_get_be16(f);
>> +
>> +        /* load enable bit and maskall bit */
>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>> +                              offset, 2);
>> +        msix_load(pdev, f);
>> +    }
>> +    pci_cmd = qemu_get_be16(f);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
>> +}
>> +
> per the previous discussion, pci config state save/restore are better
> defined in fileds of VMStateDescription.
> 
> 

With that route there is no pre-copy phase and we do want pre-copy phase
for VFIO devices. Vendor driver can skip pre-copy phase by doing nothing
on any read/write operation during pre-copy phase.

Thanks,
Kirti

>>  /*
>>   * Interrupt setup
>>   */
>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>> index 834a90d64686..847be5f56478 100644
>> --- a/hw/vfio/pci.h
>> +++ b/hw/vfio/pci.h
>> @@ -19,6 +19,7 @@
>>  #include "qemu/queue.h"
>>  #include "qemu/timer.h"
>>  
>> +#ifdef CONFIG_LINUX
>>  #define PCI_ANY_ID (~0)
>>  
>>  struct VFIOPCIDevice;
>> @@ -202,4 +203,32 @@ void vfio_display_reset(VFIOPCIDevice *vdev);
>>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>>  void vfio_display_finalize(VFIOPCIDevice *vdev);
>>  
>> +void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f);
>> +void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f);
>> +
>> +static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +
>> +    return OBJECT(vdev);
>> +}
>> +
>> +#else
>> +static inline void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    g_assert(false);
>> +}
>> +
>> +static inline void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    g_assert(false);
>> +}
>> +
>> +static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
>> +{
>> +    return NULL;
>> +}
>> +
>> +#endif
>> +
>>  #endif /* HW_VFIO_VFIO_PCI_H */
>> -- 
>> 2.7.0
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices
  2019-06-21  6:44     ` Kirti Wankhede
@ 2019-06-21  7:50       ` Yan Zhao
  0 siblings, 0 replies; 64+ messages in thread
From: Yan Zhao @ 2019-06-21  7:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Fri, Jun 21, 2019 at 02:44:30PM +0800, Kirti Wankhede wrote:
> 
> 
> On 6/21/2019 5:42 AM, Yan Zhao wrote:
> > On Thu, Jun 20, 2019 at 10:37:31PM +0800, Kirti Wankhede wrote:
> >> These functions save and restore PCI device specific data - config
> >> space of PCI device.
> >> Tested save and restore with MSI and MSIX type.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/pci.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/pci.h |  29 +++++++++++++++
> >>  2 files changed, 141 insertions(+)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index ce3fe96efe2c..09a0821a5b1c 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -1187,6 +1187,118 @@ void vfio_pci_write_config(PCIDevice *pdev,
> >>      }
> >>  }
> >>  
> >> +void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +    uint16_t pci_cmd;
> >> +    int i;
> >> +
> >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >> +        uint32_t bar;
> >> +
> >> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> >> +        qemu_put_be32(f, bar);
> >> +    }
> >> +
> >> +    qemu_put_be32(f, vdev->interrupt);
> >> +    if (vdev->interrupt == VFIO_INT_MSI) {
> >> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >> +        bool msi_64bit;
> >> +
> >> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                                            2);
> >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >> +
> >> +        msi_addr_lo = pci_default_read_config(pdev,
> >> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> >> +        qemu_put_be32(f, msi_addr_lo);
> >> +
> >> +        if (msi_64bit) {
> >> +            msi_addr_hi = pci_default_read_config(pdev,
> >> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >> +                                             4);
> >> +        }
> >> +        qemu_put_be32(f, msi_addr_hi);
> >> +
> >> +        msi_data = pci_default_read_config(pdev,
> >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >> +                2);
> >> +        qemu_put_be32(f, msi_data);
> >> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> >> +        uint16_t offset;
> >> +
> >> +        /* save enable bit and maskall bit */
> >> +        offset = pci_default_read_config(pdev,
> >> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> >> +        qemu_put_be16(f, offset);
> >> +        msix_save(pdev, f);
> >> +    }
> >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >> +    qemu_put_be16(f, pci_cmd);
> >> +}
> >> +
> >> +void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +    uint32_t interrupt_type;
> >> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >> +    uint16_t pci_cmd;
> >> +    bool msi_64bit;
> >> +    int i;
> >> +
> >> +    /* retore pci bar configuration */
> >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >> +        uint32_t bar = qemu_get_be32(f);
> >> +
> >> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> >> +    }
> >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >> +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> >> +
> >> +    interrupt_type = qemu_get_be32(f);
> >> +
> >> +    if (interrupt_type == VFIO_INT_MSI) {
> >> +        /* restore msi configuration */
> >> +        msi_flags = pci_default_read_config(pdev,
> >> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >> +
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> >> +
> >> +        msi_addr_lo = qemu_get_be32(f);
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> >> +                              msi_addr_lo, 4);
> >> +
> >> +        msi_addr_hi = qemu_get_be32(f);
> >> +        if (msi_64bit) {
> >> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >> +                                  msi_addr_hi, 4);
> >> +        }
> >> +        msi_data = qemu_get_be32(f);
> >> +        vfio_pci_write_config(pdev,
> >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >> +                msi_data, 2);
> >> +
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> >> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> >> +        uint16_t offset = qemu_get_be16(f);
> >> +
> >> +        /* load enable bit and maskall bit */
> >> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> >> +                              offset, 2);
> >> +        msix_load(pdev, f);
> >> +    }
> >> +    pci_cmd = qemu_get_be16(f);
> >> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> >> +}
> >> +
> > per the previous discussion, pci config state save/restore are better
> > defined in fileds of VMStateDescription.
> > 
> > 
> 
> With that route there is no pre-copy phase and we do want pre-copy phase
> for VFIO devices. Vendor driver can skip pre-copy phase by doing nothing
> on any read/write operation during pre-copy phase.
> 
> Thanks,
> Kirti
>
hi Kirti
It is able to register both VMSTateDescritpion and SaveVMHandlers at the
same time.

Thanks
Yan

> >>  /*
> >>   * Interrupt setup
> >>   */
> >> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> >> index 834a90d64686..847be5f56478 100644
> >> --- a/hw/vfio/pci.h
> >> +++ b/hw/vfio/pci.h
> >> @@ -19,6 +19,7 @@
> >>  #include "qemu/queue.h"
> >>  #include "qemu/timer.h"
> >>  
> >> +#ifdef CONFIG_LINUX
> >>  #define PCI_ANY_ID (~0)
> >>  
> >>  struct VFIOPCIDevice;
> >> @@ -202,4 +203,32 @@ void vfio_display_reset(VFIOPCIDevice *vdev);
> >>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> >>  void vfio_display_finalize(VFIOPCIDevice *vdev);
> >>  
> >> +void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f);
> >> +void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f);
> >> +
> >> +static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >> +{
> >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >> +
> >> +    return OBJECT(vdev);
> >> +}
> >> +
> >> +#else
> >> +static inline void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    g_assert(false);
> >> +}
> >> +
> >> +static inline void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    g_assert(false);
> >> +}
> >> +
> >> +static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >> +{
> >> +    return NULL;
> >> +}
> >> +
> >> +#endif
> >> +
> >>  #endif /* HW_VFIO_VFIO_PCI_H */
> >> -- 
> >> 2.7.0
> >>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-21  1:24   ` Yan Zhao
@ 2019-06-21  8:02     ` Kirti Wankhede
  2019-06-21  8:46       ` Yan Zhao
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21  8:02 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 6/21/2019 6:54 AM, Yan Zhao wrote:
> On Fri, Jun 21, 2019 at 08:25:18AM +0800, Yan Zhao wrote:
>> On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
>>> Add migration support for VFIO device
>>>
>>> This Patch set include patches as below:
>>> - Define KABI for VFIO device for migration support.
>>> - Added save and restore functions for PCI configuration space
>>> - Generic migration functionality for VFIO device.
>>>   * This patch set adds functionality only for PCI devices, but can be
>>>     extended to other VFIO devices.
>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
>>>     resume phases of migration.
>>>   * Added state change notifier and from that notifier function, VFIO
>>>     device's state changed is conveyed to VFIO device driver.
>>>   * During save setup phase and resume/load setup phase, migration region
>>>     is queried and is used to read/write VFIO device data.
>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
>>>     functionality of iteration during pre-copy phase.
>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>>>     iteration to read data from VFIO device driver is implemented till pending
>>>     bytes returned by driver are not zero.
>>>   * Added function to get dirty pages bitmap for the pages which are used by
>>>     driver.
>>> - Add vfio_listerner_log_sync to mark dirty pages.
>>> - Make VFIO PCI device migration capable. If migration region is not provided by
>>>   driver, migration is blocked.
>>>
>>> Below is the flow of state change for live migration where states in brackets
>>> represent VM state, migration state and VFIO device state as:
>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
>>>
>>> Live migration save path:
>>>         QEMU normal running state
>>>         (RUNNING, _NONE, _RUNNING)
>>>                         |
>>>     migrate_init spawns migration_thread.
>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
>>>     Migration thread then calls each device's .save_setup()
>>>                         |
>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
>>>     If device is active, get pending bytes by .save_live_pending()
>>>     if pending bytes >= threshold_size,  call save_live_iterate()
>>>     Data of VFIO device for pre-copy phase is copied.
>>>     Iterate till pending bytes converge and are less than threshold
>>>                         |
>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
>>>     for each active device. VFIO device is then transitioned in
>>>      _SAVING state.
>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
>>>     pending data is 0.
>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
>>
>> I suggest we also register to VMStateDescription, whose .pre_save
>> handler would get called after .save_live_complete_precopy in pre-copy
>> only case, and will called before .save_live_iterate in post-copy
>> enabled case.
>> In the .pre_save handler, we can save all device state which must be
>> copied after device stop in source vm and before device start in target vm.
>>
> hi
> to better describe this idea:
> 
> in pre-copy only case, the flow is
> 
> start migration --> .save_live_iterate (several round) -> stop source vm
> --> .save_live_complete_precopy --> .pre_save  -->start target vm
> -->migration complete
> 
> 
> in post-copy enabled case, the flow is
> 
> start migration --> .save_live_iterate (several round) --> start post copy --> 
> stop source vm --> .pre_save --> start target vm --> .save_live_iterate (several round) 
> -->migration complete
> 
> Therefore, we should put saving of device state in .pre_save interface
> rather than in .save_live_complete_precopy. 
> The device state includes pci config data, page tables, register state, etc.
> 
> The .save_live_iterate and .save_live_complete_precopy should only deal
> with saving dirty memory.
> 

Vendor driver can decide when to save device state depending on the VFIO
device state set by user. Vendor driver doesn't have to depend on which
callback function QEMU or user application calls. In pre-copy case,
save_live_complete_precopy sets VFIO device state to
VFIO_DEVICE_STATE_SAVING which means vCPUs are stopped and vendor driver
should save all device state.

> 
> I know current implementation does not support post-copy. but at least
> it should not require huge change when we decide to enable it in future.
> 

.has_postcopy and .save_live_complete_postcopy need to be implemented to
support post-copy. I think .save_live_complete_postcopy should be
similar to vfio_save_complete_precopy.

Thanks,
Kirti

> Thanks
> Yan
> 
>>>                         |
>>>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
>>>     Migraton thread schedule cleanup bottom half and exit
>>>
>>> Live migration resume path:
>>>     Incomming migration calls .load_setup for each device
>>>     (RESTORE_VM, _ACTIVE, STOPPED)
>>>                         |
>>>     For each device, .load_state is called for that device section data
>>>                         |
>>>     At the end, called .load_cleanup for each device and vCPUs are started.
>>>                         |
>>>         (RUNNING, _NONE, _RUNNING)
>>>
>>> Note that:
>>> - Migration post copy is not supported.
>>>
>>> v3 -> v4:
>>> - Added one more bit for _RESUMING flag to be set explicitly.
>>> - data_offset field is read-only for user space application.
>>> - data_size is read for every iteration before reading data from migration, that
>>>   is removed assumption that data will be till end of migration region.
>>> - If vendor driver supports mappable sparsed region, map those region during
>>>   setup state of save/load, similarly unmap those from cleanup routines.
>>> - Handles race condition that causes data corruption in migration region during
>>>   save device state by adding mutex and serialiaing save_buffer and
>>>   get_dirty_pages routines.
>>> - Skip called get_dirty_pages routine for mapped MMIO region of device.
>>> - Added trace events.
>>> - Splitted into multiple functional patches.
>>>
>>> v2 -> v3:
>>> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
>>> - Re-structured vfio_device_migration_info to keep it minimal and defined action
>>>   on read and write access on its members.
>>>
>>> v1 -> v2:
>>> - Defined MIGRATION region type and sub-type which should be used with region
>>>   type capability.
>>> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
>>>   offset of migration region.
>>> - Replaced ioctl with read/write for trapped part of migration region.
>>> - Added both type of access support, trapped or mmapped, for data section of the
>>>   region.
>>> - Moved PCI device functions to pci file.
>>> - Added iteration to get dirty page bitmap until bitmap for all requested pages
>>>   are copied.
>>>
>>> Thanks,
>>> Kirti
>>>
>>>
>>> Kirti Wankhede (13):
>>>   vfio: KABI for migration interface
>>>   vfio: Add function to unmap VFIO region
>>>   vfio: Add save and load functions for VFIO PCI devices
>>>   vfio: Add migration region initialization and finalize function
>>>   vfio: Add VM state change handler to know state of VM
>>>   vfio: Add migration state change notifier
>>>   vfio: Register SaveVMHandlers for VFIO device
>>>   vfio: Add save state functions to SaveVMHandlers
>>>   vfio: Add load state functions to SaveVMHandlers
>>>   vfio: Add function to get dirty page list
>>>   vfio: Add vfio_listerner_log_sync to mark dirty pages
>>>   vfio: Make vfio-pci device migration capable.
>>>   vfio: Add trace events in migration code path
>>>
>>>  hw/vfio/Makefile.objs         |   2 +-
>>>  hw/vfio/common.c              |  55 +++
>>>  hw/vfio/migration.c           | 815 ++++++++++++++++++++++++++++++++++++++++++
>>>  hw/vfio/pci.c                 | 126 ++++++-
>>>  hw/vfio/pci.h                 |  29 ++
>>>  hw/vfio/trace-events          |  19 +
>>>  include/hw/vfio/vfio-common.h |  22 ++
>>>  linux-headers/linux/vfio.h    |  71 ++++
>>>  8 files changed, 1132 insertions(+), 7 deletions(-)
>>>  create mode 100644 hw/vfio/migration.c
>>>
>>> -- 
>>> 2.7.0
>>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-21  8:02     ` Kirti Wankhede
@ 2019-06-21  8:46       ` Yan Zhao
  2019-06-21  9:22         ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Yan Zhao @ 2019-06-21  8:46 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Fri, Jun 21, 2019 at 04:02:50PM +0800, Kirti Wankhede wrote:
> 
> 
> On 6/21/2019 6:54 AM, Yan Zhao wrote:
> > On Fri, Jun 21, 2019 at 08:25:18AM +0800, Yan Zhao wrote:
> >> On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
> >>> Add migration support for VFIO device
> >>>
> >>> This Patch set include patches as below:
> >>> - Define KABI for VFIO device for migration support.
> >>> - Added save and restore functions for PCI configuration space
> >>> - Generic migration functionality for VFIO device.
> >>>   * This patch set adds functionality only for PCI devices, but can be
> >>>     extended to other VFIO devices.
> >>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> >>>     resume phases of migration.
> >>>   * Added state change notifier and from that notifier function, VFIO
> >>>     device's state changed is conveyed to VFIO device driver.
> >>>   * During save setup phase and resume/load setup phase, migration region
> >>>     is queried and is used to read/write VFIO device data.
> >>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> >>>     functionality of iteration during pre-copy phase.
> >>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> >>>     iteration to read data from VFIO device driver is implemented till pending
> >>>     bytes returned by driver are not zero.
> >>>   * Added function to get dirty pages bitmap for the pages which are used by
> >>>     driver.
> >>> - Add vfio_listerner_log_sync to mark dirty pages.
> >>> - Make VFIO PCI device migration capable. If migration region is not provided by
> >>>   driver, migration is blocked.
> >>>
> >>> Below is the flow of state change for live migration where states in brackets
> >>> represent VM state, migration state and VFIO device state as:
> >>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> >>>
> >>> Live migration save path:
> >>>         QEMU normal running state
> >>>         (RUNNING, _NONE, _RUNNING)
> >>>                         |
> >>>     migrate_init spawns migration_thread.
> >>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> >>>     Migration thread then calls each device's .save_setup()
> >>>                         |
> >>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> >>>     If device is active, get pending bytes by .save_live_pending()
> >>>     if pending bytes >= threshold_size,  call save_live_iterate()
> >>>     Data of VFIO device for pre-copy phase is copied.
> >>>     Iterate till pending bytes converge and are less than threshold
> >>>                         |
> >>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> >>>     for each active device. VFIO device is then transitioned in
> >>>      _SAVING state.
> >>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> >>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> >>>     pending data is 0.
> >>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> >>
> >> I suggest we also register to VMStateDescription, whose .pre_save
> >> handler would get called after .save_live_complete_precopy in pre-copy
> >> only case, and will called before .save_live_iterate in post-copy
> >> enabled case.
> >> In the .pre_save handler, we can save all device state which must be
> >> copied after device stop in source vm and before device start in target vm.
> >>
> > hi
> > to better describe this idea:
> > 
> > in pre-copy only case, the flow is
> > 
> > start migration --> .save_live_iterate (several round) -> stop source vm
> > --> .save_live_complete_precopy --> .pre_save  -->start target vm
> > -->migration complete
> > 
> > 
> > in post-copy enabled case, the flow is
> > 
> > start migration --> .save_live_iterate (several round) --> start post copy --> 
> > stop source vm --> .pre_save --> start target vm --> .save_live_iterate (several round) 
> > -->migration complete
> > 
> > Therefore, we should put saving of device state in .pre_save interface
> > rather than in .save_live_complete_precopy. 
> > The device state includes pci config data, page tables, register state, etc.
> > 
> > The .save_live_iterate and .save_live_complete_precopy should only deal
> > with saving dirty memory.
> > 
> 
> Vendor driver can decide when to save device state depending on the VFIO
> device state set by user. Vendor driver doesn't have to depend on which
> callback function QEMU or user application calls. In pre-copy case,
> save_live_complete_precopy sets VFIO device state to
> VFIO_DEVICE_STATE_SAVING which means vCPUs are stopped and vendor driver
> should save all device state.
>
when post copy stops vCPUs and vfio device, vendor driver only needs to
provide device state. but how vendor driver knows that, if no extra
interface or no extra device state is provides?

> > 
> > I know current implementation does not support post-copy. but at least
> > it should not require huge change when we decide to enable it in future.
> > 
> 
> .has_postcopy and .save_live_complete_postcopy need to be implemented to
> support post-copy. I think .save_live_complete_postcopy should be
> similar to vfio_save_complete_precopy.
> 
> Thanks,
> Kirti
> 
> > Thanks
> > Yan
> > 
> >>>                         |
> >>>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> >>>     Migraton thread schedule cleanup bottom half and exit
> >>>
> >>> Live migration resume path:
> >>>     Incomming migration calls .load_setup for each device
> >>>     (RESTORE_VM, _ACTIVE, STOPPED)
> >>>                         |
> >>>     For each device, .load_state is called for that device section data
> >>>                         |
> >>>     At the end, called .load_cleanup for each device and vCPUs are started.
> >>>                         |
> >>>         (RUNNING, _NONE, _RUNNING)
> >>>
> >>> Note that:
> >>> - Migration post copy is not supported.
> >>>
> >>> v3 -> v4:
> >>> - Added one more bit for _RESUMING flag to be set explicitly.
> >>> - data_offset field is read-only for user space application.
> >>> - data_size is read for every iteration before reading data from migration, that
> >>>   is removed assumption that data will be till end of migration region.
> >>> - If vendor driver supports mappable sparsed region, map those region during
> >>>   setup state of save/load, similarly unmap those from cleanup routines.
> >>> - Handles race condition that causes data corruption in migration region during
> >>>   save device state by adding mutex and serialiaing save_buffer and
> >>>   get_dirty_pages routines.
> >>> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> >>> - Added trace events.
> >>> - Splitted into multiple functional patches.
> >>>
> >>> v2 -> v3:
> >>> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> >>> - Re-structured vfio_device_migration_info to keep it minimal and defined action
> >>>   on read and write access on its members.
> >>>
> >>> v1 -> v2:
> >>> - Defined MIGRATION region type and sub-type which should be used with region
> >>>   type capability.
> >>> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> >>>   offset of migration region.
> >>> - Replaced ioctl with read/write for trapped part of migration region.
> >>> - Added both type of access support, trapped or mmapped, for data section of the
> >>>   region.
> >>> - Moved PCI device functions to pci file.
> >>> - Added iteration to get dirty page bitmap until bitmap for all requested pages
> >>>   are copied.
> >>>
> >>> Thanks,
> >>> Kirti
> >>>
> >>>
> >>> Kirti Wankhede (13):
> >>>   vfio: KABI for migration interface
> >>>   vfio: Add function to unmap VFIO region
> >>>   vfio: Add save and load functions for VFIO PCI devices
> >>>   vfio: Add migration region initialization and finalize function
> >>>   vfio: Add VM state change handler to know state of VM
> >>>   vfio: Add migration state change notifier
> >>>   vfio: Register SaveVMHandlers for VFIO device
> >>>   vfio: Add save state functions to SaveVMHandlers
> >>>   vfio: Add load state functions to SaveVMHandlers
> >>>   vfio: Add function to get dirty page list
> >>>   vfio: Add vfio_listerner_log_sync to mark dirty pages
> >>>   vfio: Make vfio-pci device migration capable.
> >>>   vfio: Add trace events in migration code path
> >>>
> >>>  hw/vfio/Makefile.objs         |   2 +-
> >>>  hw/vfio/common.c              |  55 +++
> >>>  hw/vfio/migration.c           | 815 ++++++++++++++++++++++++++++++++++++++++++
> >>>  hw/vfio/pci.c                 | 126 ++++++-
> >>>  hw/vfio/pci.h                 |  29 ++
> >>>  hw/vfio/trace-events          |  19 +
> >>>  include/hw/vfio/vfio-common.h |  22 ++
> >>>  linux-headers/linux/vfio.h    |  71 ++++
> >>>  8 files changed, 1132 insertions(+), 7 deletions(-)
> >>>  create mode 100644 hw/vfio/migration.c
> >>>
> >>> -- 
> >>> 2.7.0
> >>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-21  8:46       ` Yan Zhao
@ 2019-06-21  9:22         ` Kirti Wankhede
  2019-06-21 10:45           ` Yan Zhao
  2019-06-24 19:00           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21  9:22 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 6/21/2019 2:16 PM, Yan Zhao wrote:
> On Fri, Jun 21, 2019 at 04:02:50PM +0800, Kirti Wankhede wrote:
>>
>>
>> On 6/21/2019 6:54 AM, Yan Zhao wrote:
>>> On Fri, Jun 21, 2019 at 08:25:18AM +0800, Yan Zhao wrote:
>>>> On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
>>>>> Add migration support for VFIO device
>>>>>
>>>>> This Patch set include patches as below:
>>>>> - Define KABI for VFIO device for migration support.
>>>>> - Added save and restore functions for PCI configuration space
>>>>> - Generic migration functionality for VFIO device.
>>>>>   * This patch set adds functionality only for PCI devices, but can be
>>>>>     extended to other VFIO devices.
>>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
>>>>>     resume phases of migration.
>>>>>   * Added state change notifier and from that notifier function, VFIO
>>>>>     device's state changed is conveyed to VFIO device driver.
>>>>>   * During save setup phase and resume/load setup phase, migration region
>>>>>     is queried and is used to read/write VFIO device data.
>>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
>>>>>     functionality of iteration during pre-copy phase.
>>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>>>>>     iteration to read data from VFIO device driver is implemented till pending
>>>>>     bytes returned by driver are not zero.
>>>>>   * Added function to get dirty pages bitmap for the pages which are used by
>>>>>     driver.
>>>>> - Add vfio_listerner_log_sync to mark dirty pages.
>>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
>>>>>   driver, migration is blocked.
>>>>>
>>>>> Below is the flow of state change for live migration where states in brackets
>>>>> represent VM state, migration state and VFIO device state as:
>>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
>>>>>
>>>>> Live migration save path:
>>>>>         QEMU normal running state
>>>>>         (RUNNING, _NONE, _RUNNING)
>>>>>                         |
>>>>>     migrate_init spawns migration_thread.
>>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
>>>>>     Migration thread then calls each device's .save_setup()
>>>>>                         |
>>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
>>>>>     If device is active, get pending bytes by .save_live_pending()
>>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
>>>>>     Data of VFIO device for pre-copy phase is copied.
>>>>>     Iterate till pending bytes converge and are less than threshold
>>>>>                         |
>>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
>>>>>     for each active device. VFIO device is then transitioned in
>>>>>      _SAVING state.
>>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
>>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
>>>>>     pending data is 0.
>>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
>>>>
>>>> I suggest we also register to VMStateDescription, whose .pre_save
>>>> handler would get called after .save_live_complete_precopy in pre-copy
>>>> only case, and will called before .save_live_iterate in post-copy
>>>> enabled case.
>>>> In the .pre_save handler, we can save all device state which must be
>>>> copied after device stop in source vm and before device start in target vm.
>>>>
>>> hi
>>> to better describe this idea:
>>>
>>> in pre-copy only case, the flow is
>>>
>>> start migration --> .save_live_iterate (several round) -> stop source vm
>>> --> .save_live_complete_precopy --> .pre_save  -->start target vm
>>> -->migration complete
>>>
>>>
>>> in post-copy enabled case, the flow is
>>>
>>> start migration --> .save_live_iterate (several round) --> start post copy --> 
>>> stop source vm --> .pre_save --> start target vm --> .save_live_iterate (several round) 
>>> -->migration complete
>>>
>>> Therefore, we should put saving of device state in .pre_save interface
>>> rather than in .save_live_complete_precopy. 
>>> The device state includes pci config data, page tables, register state, etc.
>>>
>>> The .save_live_iterate and .save_live_complete_precopy should only deal
>>> with saving dirty memory.
>>>
>>
>> Vendor driver can decide when to save device state depending on the VFIO
>> device state set by user. Vendor driver doesn't have to depend on which
>> callback function QEMU or user application calls. In pre-copy case,
>> save_live_complete_precopy sets VFIO device state to
>> VFIO_DEVICE_STATE_SAVING which means vCPUs are stopped and vendor driver
>> should save all device state.
>>
> when post copy stops vCPUs and vfio device, vendor driver only needs to
> provide device state. but how vendor driver knows that, if no extra
> interface or no extra device state is provides?
> 

.save_live_complete_postcopy interface for post-copy will get called,
right?

Thanks,
Kirti

>>>
>>> I know current implementation does not support post-copy. but at least
>>> it should not require huge change when we decide to enable it in future.
>>>
>>
>> .has_postcopy and .save_live_complete_postcopy need to be implemented to
>> support post-copy. I think .save_live_complete_postcopy should be
>> similar to vfio_save_complete_precopy.
>>
>> Thanks,
>> Kirti
>>
>>> Thanks
>>> Yan
>>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-21  9:22         ` Kirti Wankhede
@ 2019-06-21 10:45           ` Yan Zhao
  2019-06-24 19:00           ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 64+ messages in thread
From: Yan Zhao @ 2019-06-21 10:45 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Fri, Jun 21, 2019 at 05:22:37PM +0800, Kirti Wankhede wrote:
> 
> 
> On 6/21/2019 2:16 PM, Yan Zhao wrote:
> > On Fri, Jun 21, 2019 at 04:02:50PM +0800, Kirti Wankhede wrote:
> >>
> >>
> >> On 6/21/2019 6:54 AM, Yan Zhao wrote:
> >>> On Fri, Jun 21, 2019 at 08:25:18AM +0800, Yan Zhao wrote:
> >>>> On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
> >>>>> Add migration support for VFIO device
> >>>>>
> >>>>> This Patch set include patches as below:
> >>>>> - Define KABI for VFIO device for migration support.
> >>>>> - Added save and restore functions for PCI configuration space
> >>>>> - Generic migration functionality for VFIO device.
> >>>>>   * This patch set adds functionality only for PCI devices, but can be
> >>>>>     extended to other VFIO devices.
> >>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> >>>>>     resume phases of migration.
> >>>>>   * Added state change notifier and from that notifier function, VFIO
> >>>>>     device's state changed is conveyed to VFIO device driver.
> >>>>>   * During save setup phase and resume/load setup phase, migration region
> >>>>>     is queried and is used to read/write VFIO device data.
> >>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> >>>>>     functionality of iteration during pre-copy phase.
> >>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> >>>>>     iteration to read data from VFIO device driver is implemented till pending
> >>>>>     bytes returned by driver are not zero.
> >>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> >>>>>     driver.
> >>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> >>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> >>>>>   driver, migration is blocked.
> >>>>>
> >>>>> Below is the flow of state change for live migration where states in brackets
> >>>>> represent VM state, migration state and VFIO device state as:
> >>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> >>>>>
> >>>>> Live migration save path:
> >>>>>         QEMU normal running state
> >>>>>         (RUNNING, _NONE, _RUNNING)
> >>>>>                         |
> >>>>>     migrate_init spawns migration_thread.
> >>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> >>>>>     Migration thread then calls each device's .save_setup()
> >>>>>                         |
> >>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> >>>>>     If device is active, get pending bytes by .save_live_pending()
> >>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> >>>>>     Data of VFIO device for pre-copy phase is copied.
> >>>>>     Iterate till pending bytes converge and are less than threshold
> >>>>>                         |
> >>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> >>>>>     for each active device. VFIO device is then transitioned in
> >>>>>      _SAVING state.
> >>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> >>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> >>>>>     pending data is 0.
> >>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> >>>>
> >>>> I suggest we also register to VMStateDescription, whose .pre_save
> >>>> handler would get called after .save_live_complete_precopy in pre-copy
> >>>> only case, and will called before .save_live_iterate in post-copy
> >>>> enabled case.
> >>>> In the .pre_save handler, we can save all device state which must be
> >>>> copied after device stop in source vm and before device start in target vm.
> >>>>
> >>> hi
> >>> to better describe this idea:
> >>>
> >>> in pre-copy only case, the flow is
> >>>
> >>> start migration --> .save_live_iterate (several round) -> stop source vm
> >>> --> .save_live_complete_precopy --> .pre_save  -->start target vm
> >>> -->migration complete
> >>>
> >>>
> >>> in post-copy enabled case, the flow is
> >>>
> >>> start migration --> .save_live_iterate (several round) --> start post copy --> 
> >>> stop source vm --> .pre_save --> start target vm --> .save_live_iterate (several round) 
> >>> -->migration complete
> >>>
> >>> Therefore, we should put saving of device state in .pre_save interface
> >>> rather than in .save_live_complete_precopy. 
> >>> The device state includes pci config data, page tables, register state, etc.
> >>>
> >>> The .save_live_iterate and .save_live_complete_precopy should only deal
> >>> with saving dirty memory.
> >>>
> >>
> >> Vendor driver can decide when to save device state depending on the VFIO
> >> device state set by user. Vendor driver doesn't have to depend on which
> >> callback function QEMU or user application calls. In pre-copy case,
> >> save_live_complete_precopy sets VFIO device state to
> >> VFIO_DEVICE_STATE_SAVING which means vCPUs are stopped and vendor driver
> >> should save all device state.
> >>
> > when post copy stops vCPUs and vfio device, vendor driver only needs to
> > provide device state. but how vendor driver knows that, if no extra
> > interface or no extra device state is provides?
> > 
> 
> .save_live_complete_postcopy interface for post-copy will get called,
> right?
>
yes, but it's too late after postcopy completion

> Thanks,
> Kirti
> 
> >>>
> >>> I know current implementation does not support post-copy. but at least
> >>> it should not require huge change when we decide to enable it in future.
> >>>
> >>
> >> .has_postcopy and .save_live_complete_postcopy need to be implemented to
> >> support post-copy. I think .save_live_complete_postcopy should be
> >> similar to vfio_save_complete_precopy.
> >>
> >> Thanks,
> >> Kirti
> >>
> >>> Thanks
> >>> Yan
> >>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-21  5:52     ` Kirti Wankhede
@ 2019-06-21 15:03       ` Alex Williamson
  2019-06-21 19:35         ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-21 15:03 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Fri, 21 Jun 2019 11:22:15 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/20/2019 10:48 PM, Alex Williamson wrote:
> > On Thu, 20 Jun 2019 20:07:29 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> - Defined MIGRATION region type and sub-type.
> >> - Used 3 bits to define VFIO device states.
> >>     Bit 0 => _RUNNING
> >>     Bit 1 => _SAVING
> >>     Bit 2 => _RESUMING
> >>     Combination of these bits defines VFIO device's state during migration
> >>     _STOPPED => All bits 0 indicates VFIO device stopped.
> >>     _RUNNING => Normal VFIO device running state.
> >>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
> >>                           saving state of device i.e. pre-copy state
> >>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> >>                           save device state,i.e. stop-n-copy state
> >>     _RESUMING => VFIO device resuming state.
> >>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
> >> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>   offset of migration region to get/set VFIO device related information.
> >>   Defined members of structure and usage on read/write access:
> >>     * device_state: (read/write)
> >>         To convey VFIO device state to be transitioned to. Only 3 bits are used
> >>         as of now.
> >>     * pending bytes: (read only)
> >>         To get pending bytes yet to be migrated for VFIO device.
> >>     * data_offset: (read only)
> >>         To get data offset in migration from where data exist during _SAVING
> >>         and from where data should be written by user space application during
> >>          _RESUMING state
> >>     * data_size: (read/write)
> >>         To get and set size of data copied in migration region during _SAVING
> >>         and _RESUMING state.
> >>     * start_pfn, page_size, total_pfns: (write only)
> >>         To get bitmap of dirty pages from vendor driver from given
> >>         start address for total_pfns.
> >>     * copied_pfns: (read only)
> >>         To get number of pfns bitmap copied in migration region.
> >>         Vendor driver should copy the bitmap with bits set only for
> >>         pages to be marked dirty in migration region. Vendor driver
> >>         should return 0 if there are 0 pages dirty in requested
> >>         range. Vendor driver should return -1 to mark all pages in the section
> >>         as dirty
> >>
> >> Migration region looks like:
> >>  ------------------------------------------------------------------
> >> |vfio_device_migration_info|    data section                      |
> >> |                          |     ///////////////////////////////  |
> >>  ------------------------------------------------------------------
> >>  ^                              ^                              ^
> >>  offset 0-trapped part        data_offset                 data_size
> >>
> >> Data section is always followed by vfio_device_migration_info
> >> structure in the region, so data_offset will always be none-0.
> >> Offset from where data is copied is decided by kernel driver, data
> >> section can be trapped or mapped depending on how kernel driver
> >> defines data section. If mmapped, then data_offset should be page
> >> aligned, where as initial section which contain
> >> vfio_device_migration_info structure might not end at offset which
> >> is page aligned.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 71 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 24f505199f83..274ec477eb82 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
> >>   */
> >>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> >>  
> >> +/* Migration region type and sub-type */
> >> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
> >> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> >> +
> >> +/**
> >> + * Structure vfio_device_migration_info is placed at 0th offset of
> >> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >> + * information. Field accesses from this structure are only supported at their
> >> + * native width and alignment, otherwise should return error.
> >> + *
> >> + * device_state: (read/write)
> >> + *      To indicate vendor driver the state VFIO device should be transitioned
> >> + *      to. If device state transition fails, write to this field return error.
> >> + *      It consists of 3 bits:
> >> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> >> + *        device before write returns.
> >> + *      - If bit 1 set, indicates _SAVING state.
> >> + *      - If bit 2 set, indicates _RESUMING state.
> >> + *
> >> + * pending bytes: (read only)
> >> + *      Read pending bytes yet to be migrated from vendor driver
> >> + *
> >> + * data_offset: (read only)
> >> + *      User application should read data_offset in migration region from where
> >> + *      user application should read data during _SAVING state or write data
> >> + *      during _RESUMING state.
> >> + *
> >> + * data_size: (read/write)
> >> + *      User application should read data_size to know data copied in migration
> >> + *      region during _SAVING state and write size of data copied in migration
> >> + *      region during _RESUMING state.
> >> + *
> >> + * start_pfn: (write only)
> >> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> >> + *      _SAVING state.
> >> + *
> >> + * page_size: (write only)
> >> + *      User application should write the page_size of pfn.
> >> + *
> >> + * total_pfns: (write only)
> >> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> >> + *
> >> + * copied_pfns: (read only)
> >> + *      pfn count for which dirty bitmap is copied to migration region.
> >> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> >> + *      marked dirty in migration region.
> >> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
> >> + *      range.
> >> + *      Vendor driver should return -1 to mark all pages in the section as
> >> + *      dirty.  
> > 
> > Is the protocol that the user writes start_pfn/page_size/total_pfns in
> > any order and then the read of copied_pfns is what triggers the
> > snapshot?  
> 
> Yes.
> 
> >  Are start_pfn/page_size/total_pfns sticky such that a user
> > can write them once and get repeated refreshes of the dirty bitmap by
> > re-reading copied_pfns?  
> 
> Yes and that bitmap should be for given range (from start_pfn till
> start_pfn + tolal_pfns).
> Re-reading of copied_pfns is to handle the case where it might be
> possible that vendor driver reserved area for bitmap < total bitmap size
> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> is, there are no pages dirty in rest of the range)

So reading copied_pfns triggers the data range to be updated, but the
caller cannot assume it to be synchronous and uses total_pfns to poll
that the update is complete?  How does the vendor driver differentiate
the user polling for the previous update to finish versus requesting a
new update?

> >  What's the advantage to returning -1 versus
> > returning copied_pfns == total_pfns?
> >   
> 
> If all bits in bitmap are 1, then return -1, that is, all pages in the
> given range to be marked dirty.
> 
> If all bits in bitmap are 0, then return 0, that is, no page to be
> marked dirty in given range or rest of the range.
> 
> Otherwise vendor driver should return copied_pfns == total_pfn and
> provide bitmap for total_pfn, which means that bitmap copied for given
> range contains information for all pages where some bits are 0s and some
> are 1s.

Given that the vendor driver can indicate zero dirty pfns and all dirty
pfns, I interpreted copied_pfns as a synchronous operation where the
return value could indicate the number of dirty pages within the
requested range.

> > If the user then wants to switch back to reading device migration
> > state, is it a read of data_size that switches the data area back to
> > making that address space available?   
> 
> No, Its not just read(data_size), before that there is a
> read(data_offset). If Vendor driver wants to have different sub-regions
> for device data and dirty page bitmap, vendor driver should return
> corresponding offset on read(data_offset).

The dynamic use of data_offset was not at all evident to me until I got
further into the QEMU series.  The usage model needs to be well
specified in the linux header.  I infer this behavior is such that the
vendor driver can effectively identity map portions of device memory
and the user will restore to the same offset.  I suppose this is a
valid approach but it seems specifically tuned to devices which allow
full direct mapping, whereas many devices have more device memory than
is directly map'able and state beyond simple device memory.  Does this
model unnecessarily burden such devices?  It is a nice feature that
they data range can contain both mmap'd sections and trapped sections
and by adjusting data_offset the vendor driver can select which is
currently being used, but we really need to formalize all these details.

> > In each case, is it the user's
> > responsibility to consume all the data provided before triggering the
> > next data area?> For example, if I ask for a range of dirty bitmap, the
> > vendor driver will provide that range and and clear it, such that the
> > pages are considered clean regardless of whether the user consumed the
> > data area.    
> 
> Yes.
> 
> > Likewise if the user asks for data_size, that would be
> > deducted from pending_bytes regardless of the user reading the data
> > area.   
> 
> User should read data before deducting data_size from pending_bytes.

The user deducts data_size form pending_bytes?  pending_bytes is
read-only, how does this work?

> From vendor driver point of view, data_size will be deducted from
> pending_bytes once data is copied to data region.

If the data is entirely from an mmap'd range, how does the vendor
driver know when the data is copied?

> > Are there any read side-effects to pending_bytes?  
> 
> No, its query to vendor driver about pending bytes yet to be
> migrated/read from vendor driver.
> 
> >  Are there
> > read side-effects to the data area on SAVING?  
> 
> No.

So the vendor driver must make an assumption somewhere in the usage
protocol that it's the user's responsibility, this needs to be
specified.

> >  Are there write
> > side-effects on RESUMING, or is it only the write of data_size that
> > triggers the buffer to be consumed?  
> 
> Its write(data_size) triggers the buffer to be consumed, if region is
> mmaped, then data is already copied to region, if its trapped then
> following writes from data_offset is data to be consumed.
> 
> >  Is it the user's responsibility to
> > write only full "packets" on RESUMING?  For example if the SAVING side
> > provides data_size X, that full data_size X must be written to the
> > RESUMING side, the user cannot write half of it to the data area on the
> > RESUMING side, write data_size with X/2, write the second half, and
> > again write X/2.  IOW, the data_size "packet" is indivisible at the
> > point of resuming.
> >   
> 
> If source and destination are compatible or of same driver version, then
> if user is reading data_size X at source/SAVING, destination should be
> able to consume data_size X at restoring/RESUMING. Then why should user
> write X/2 and iterate?

Because users do things we don't expect ;)  Maybe they decide to chunk
the data into smaller packets over the network, but the receiving side
would rather write the packet immediately rather than queuing it.
OTOH, does it necessarily matter so long as data_size is written on
completion of a full "packet"?

> > What are the ordering requirements?  Must the user write data_size
> > packets in the same order that they're read, or is it the vendor
> > driver's responsibility to include sequence information and allow
> > restore in any order?
> >   
> 
> For user, data is opaque. User should write data in the same order as he
> received.

Let's make sure that's specified.

> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +        __u32 device_state;         /* VFIO device state */
> >> +#define VFIO_DEVICE_STATE_STOPPED   (0)  
> > 
> > We need to be careful with how this is used if we want to leave the
> > possibility of using the remaining 29 bits of this register.  Maybe we
> > want to define VFIO_DEVICE_STATE_MASK and be sure that we only do
> > read-modify-write ops within the mask (ex. set_bit and clear_bit
> > helpers).  
> 
> Makes sense, I'll do changes in next iteration.
> 
> >  Also, above we define STOPPED to indicate simply
> > not-RUNNING, but here it seems STOPPED means not-RUNNING, not-SAVING,
> > and not-RESUMING.
> >   
> 
> That's correct.
> 
> >> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> >> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> >> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> >> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> >> +                                     VFIO_DEVICE_STATE_RESUMING)
> >> +        __u32 reserved;
> >> +        __u64 pending_bytes;
> >> +        __u64 data_offset;  
> > 
> > Placing the data more than 4GB into the region seems a bit absurd, so
> > this could probably be a __u32 and take the place of the reserved field.
> >   
> 
> Is there a maximum limit on VFIO region size?
> There isn't any such limit, right? Vendor driver can define region of
> any size and then place data section anywhere in the region. I prefer to
> keep it __u64.

We have a single file descriptor for all accesses to the device, which
gives us quite a bit of per device address space.  As I mention above,
it wasn't clear to me that data_offset is used dynamically until I got
further into the series, so it seemed strange to me that we'd choose
such a large offset, but given my new understanding I agree it requires
a __u64 currently.  Thanks,

Alex

> >> +        __u64 data_size;
> >> +        __u64 start_pfn;
> >> +        __u64 page_size;
> >> +        __u64 total_pfns;
> >> +        __s64 copied_pfns;  
> > 
> > If this is signed so that we can get -1 then the user could
> > theoretically specify total_pfns that we can't represent in
> > copied_pfns.  Probably best to use unsigned and specify ~0 rather than
> > -1.
> >   
> 
> Ok.
> 
> > Overall this looks like a good interface, but we need to more
> > thoroughly define the protocol with the data area and set expectations
> > we're placing on the user and vendor driver.  There should be no usage
> > assumptions, it should all be spelled out.  Thanks,
> >  
> 
> Thanks for your feedback. I'll update comments above to be more specific.
> 
> Thanks,
> Kirti
> 
> > Alex
> >   
> >> +} __attribute__((packed));
> >> +
> >>  /*
> >>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >>   * which allows direct access to non-MSIX registers which happened to be within  
> >   



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-21  6:38     ` Kirti Wankhede
@ 2019-06-21 15:16       ` Alex Williamson
  2019-06-21 19:38         ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-21 15:16 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, yulei.zhang, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Fri, 21 Jun 2019 12:08:26 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/21/2019 12:55 AM, Alex Williamson wrote:
> > On Thu, 20 Jun 2019 20:07:36 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> >> functions. These functions handles pre-copy and stop-and-copy phase.
> >>
> >> In _SAVING|_RUNNING device state or pre-copy phase:
> >> - read pending_bytes
> >> - read data_offset - indicates kernel driver to write data to staging
> >>   buffer which is mmapped.  
> > 
> > Why is data_offset the trigger rather than data_size?  It seems that
> > data_offset can't really change dynamically since it might be mmap'd,
> > so it seems unnatural to bother re-reading it.
> >   
> 
> Vendor driver can change data_offset, he can have different data_offset
> for device data and dirty pages bitmap.
> 
> >> - read data_size - amount of data in bytes written by vendor driver in migration
> >>   region.
> >> - if data section is trapped, pread() number of bytes in data_size, from
> >>   data_offset.
> >> - if data section is mmaped, read mmaped buffer of size data_size.
> >> - Write data packet to file stream as below:
> >> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> >> VFIO_MIG_FLAG_END_OF_STATE }
> >>
> >> In _SAVING device state or stop-and-copy phase
> >> a. read config space of device and save to migration file stream. This
> >>    doesn't need to be from vendor driver. Any other special config state
> >>    from driver can be saved as data in following iteration.
> >> b. read pending_bytes - indicates kernel driver to write data to staging
> >>    buffer which is mmapped.  
> > 
> > Is it pending_bytes or data_offset that triggers the write out of
> > data?  Why pending_bytes vs data_size?  I was interpreting
> > pending_bytes as the total data size while data_size is the size
> > available to read now, so assumed data_size would be more closely
> > aligned to making the data available.
> >   
> 
> Sorry, that's my mistake while editing, its read data_offset as in above
> case.
> 
> >> c. read data_size - amount of data in bytes written by vendor driver in
> >>    migration region.
> >> d. if data section is trapped, pread() from data_offset of size data_size.
> >> e. if data section is mmaped, read mmaped buffer of size data_size.  
> > 
> > Should this read as "pread() from data_offset of data_size, or
> > optionally if mmap is supported on the data area, read data_size from
> > start of mapped buffer"?  IOW, pread should always work.  Same in
> > previous section.
> >   
> 
> ok. I'll update.
> 
> >> f. Write data packet as below:
> >>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> >> g. iterate through steps b to f until (pending_bytes > 0)  
> > 
> > s/until/while/  
> 
> Ok.
> 
> >   
> >> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> >>
> >> .save_live_iterate runs outside the iothread lock in the migration case, which
> >> could race with asynchronous call to get dirty page list causing data corruption
> >> in mapped migration region. Mutex added here to serial migration buffer read
> >> operation.  
> > 
> > Would we be ahead to use different offsets within the region for device
> > data vs dirty bitmap to avoid this?
> >  
> 
> Lock will still be required to serialize the read/write operations on
> vfio_device_migration_info structure in the region.
> 
> 
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 212 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index fe0887c27664..0a2f30872316 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> >>      return 0;
> >>  }
> >>  
> >> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region.buffer;
> >> +    uint64_t data_offset = 0, data_size = 0;
> >> +    int ret;
> >> +
> >> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data_offset));
> >> +    if (ret != sizeof(data_offset)) {
> >> +        error_report("Failed to get migration buffer data offset %d",
> >> +                     ret);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data_size));
> >> +    if (ret != sizeof(data_size)) {
> >> +        error_report("Failed to get migration buffer data size %d",
> >> +                     ret);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    if (data_size > 0) {
> >> +        void *buf = NULL;
> >> +        bool buffer_mmaped = false;
> >> +
> >> +        if (region->mmaps) {
> >> +            int i;
> >> +
> >> +            for (i = 0; i < region->nr_mmaps; i++) {
> >> +                if ((data_offset >= region->mmaps[i].offset) &&
> >> +                    (data_offset < region->mmaps[i].offset +
> >> +                                   region->mmaps[i].size)) {
> >> +                    buf = region->mmaps[i].mmap + (data_offset -
> >> +                                                   region->mmaps[i].offset);  
> > 
> > So you're expecting that data_offset is somewhere within the data
> > area.  Why doesn't the data always simply start at the beginning of the
> > data area?  ie. data_offset would coincide with the beginning of the
> > mmap'able area (if supported) and be static.  Does this enable some
> > functionality in the vendor driver?  
> 
> Do you want to enforce that to vendor driver?
> From the feedback on previous version I thought vendor driver should
> define data_offset within the region
> "I'd suggest that the vendor driver expose a read-only
> data_offset that matches a sparse mmap capability entry should the
> driver support mmap.  The use should always read or write data from the
> vendor defined data_offset"
> 
> This also adds flexibility to vendor driver such that vendor driver can
> define different data_offset for device data and dirty page bitmap
> within same mmaped region.

I agree, it adds flexibility, the protocol was not evident to me until
I got here though.

> >  Does resume data need to be
> > written from the same offset where it's read?  
> 
> No, resume data should be written from the data_offset that vendor
> driver provided during resume.

s/resume/save/?

Or is this saying that on resume that the vendor driver is requesting a
specific block of data via data_offset?  I think resume is going to be
directed by the user, writing in the same order they received the
data.  Thanks,

Alex

> >> +                    buffer_mmaped = true;
> >> +                    break;
> >> +                }
> >> +            }
> >> +        }
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            buf = g_malloc0(data_size);
> >> +            ret = pread(vbasedev->fd, buf, data_size,
> >> +                        region->fd_offset + data_offset);
> >> +            if (ret != data_size) {
> >> +                error_report("Failed to get migration data %d", ret);
> >> +                g_free(buf);
> >> +                return -EINVAL;
> >> +            }
> >> +        }
> >> +
> >> +        qemu_put_be64(f, data_size);
> >> +        qemu_put_buffer(f, buf, data_size);
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            g_free(buf);
> >> +        }
> >> +        migration->pending_bytes -= data_size;
> >> +    } else {
> >> +        qemu_put_be64(f, data_size);
> >> +    }
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +
> >> +    return data_size;
> >> +}
> >> +
> >> +static int vfio_update_pending(VFIODevice *vbasedev)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region.buffer;
> >> +    uint64_t pending_bytes = 0;
> >> +    int ret;
> >> +
> >> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             pending_bytes));  
> > 
> > Did this trigger the vendor driver to write out to the data area when
> > we don't need it to?
> >   
> 
> No, as I mentioned above, I'll update the description.
> 
> Thanks,
> Kirti
> 
> >> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> >> +        error_report("Failed to get pending bytes %d", ret);
> >> +        migration->pending_bytes = 0;
> >> +        return (ret < 0) ? ret : -EINVAL;
> >> +    }
> >> +
> >> +    migration->pending_bytes = pending_bytes;
> >> +    return 0;
> >> +}
> >> +
> >> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> >> +
> >> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> >> +        vfio_pci_save_config(vbasedev, f);
> >> +    }
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    return qemu_file_get_error(f);
> >> +}
> >> +
> >>  /* ---------------------------------------------------------------------- */
> >>  
> >>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> >> @@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
> >>      }
> >>  }
> >>  
> >> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> >> +                              uint64_t threshold_size,
> >> +                              uint64_t *res_precopy_only,
> >> +                              uint64_t *res_compatible,
> >> +                              uint64_t *res_postcopy_only)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret;
> >> +
> >> +    ret = vfio_update_pending(vbasedev);
> >> +    if (ret) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> >> +        *res_precopy_only += migration->pending_bytes;
> >> +    } else {
> >> +        *res_postcopy_only += migration->pending_bytes;
> >> +    }
> >> +    *res_compatible += 0;
> >> +}
> >> +
> >> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret;
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> >> +
> >> +    qemu_mutex_lock(&migration->lock);
> >> +    ret = vfio_save_buffer(f, vbasedev);
> >> +    qemu_mutex_unlock(&migration->lock);
> >> +
> >> +    if (ret < 0) {
> >> +        error_report("vfio_save_buffer failed %s",
> >> +                     strerror(errno));
> >> +        return ret;
> >> +    }
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret;
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> >> +    if (ret) {
> >> +        error_report("Failed to set state STOP and SAVING");
> >> +        return ret;
> >> +    }
> >> +
> >> +    ret = vfio_save_device_config_state(f, opaque);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    ret = vfio_update_pending(vbasedev);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    while (migration->pending_bytes > 0) {
> >> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> >> +        ret = vfio_save_buffer(f, vbasedev);
> >> +        if (ret < 0) {
> >> +            error_report("Failed to save buffer");
> >> +            return ret;
> >> +        } else if (ret == 0) {
> >> +            break;
> >> +        }
> >> +
> >> +        ret = vfio_update_pending(vbasedev);
> >> +        if (ret) {
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
> >> +    if (ret) {
> >> +        error_report("Failed to set state STOPPED");
> >> +        return ret;
> >> +    }
> >> +    return ret;
> >> +}
> >> +
> >>  static SaveVMHandlers savevm_vfio_handlers = {
> >>      .save_setup = vfio_save_setup,
> >>      .save_cleanup = vfio_save_cleanup,
> >> +    .save_live_pending = vfio_save_pending,
> >> +    .save_live_iterate = vfio_save_iterate,
> >> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> >>  };
> >>  
> >>  /* ---------------------------------------------------------------------- */  
> >   



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-21 15:03       ` Alex Williamson
@ 2019-06-21 19:35         ` Kirti Wankhede
  2019-06-21 20:00           ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21 19:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 6/21/2019 8:33 PM, Alex Williamson wrote:
> On Fri, 21 Jun 2019 11:22:15 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/20/2019 10:48 PM, Alex Williamson wrote:
>>> On Thu, 20 Jun 2019 20:07:29 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> - Defined MIGRATION region type and sub-type.
>>>> - Used 3 bits to define VFIO device states.
>>>>     Bit 0 => _RUNNING
>>>>     Bit 1 => _SAVING
>>>>     Bit 2 => _RESUMING
>>>>     Combination of these bits defines VFIO device's state during migration
>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
>>>>     _RUNNING => Normal VFIO device running state.
>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>>>>                           saving state of device i.e. pre-copy state
>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>>>>                           save device state,i.e. stop-n-copy state
>>>>     _RESUMING => VFIO device resuming state.
>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>>>   offset of migration region to get/set VFIO device related information.
>>>>   Defined members of structure and usage on read/write access:
>>>>     * device_state: (read/write)
>>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>>>>         as of now.
>>>>     * pending bytes: (read only)
>>>>         To get pending bytes yet to be migrated for VFIO device.
>>>>     * data_offset: (read only)
>>>>         To get data offset in migration from where data exist during _SAVING
>>>>         and from where data should be written by user space application during
>>>>          _RESUMING state
>>>>     * data_size: (read/write)
>>>>         To get and set size of data copied in migration region during _SAVING
>>>>         and _RESUMING state.
>>>>     * start_pfn, page_size, total_pfns: (write only)
>>>>         To get bitmap of dirty pages from vendor driver from given
>>>>         start address for total_pfns.
>>>>     * copied_pfns: (read only)
>>>>         To get number of pfns bitmap copied in migration region.
>>>>         Vendor driver should copy the bitmap with bits set only for
>>>>         pages to be marked dirty in migration region. Vendor driver
>>>>         should return 0 if there are 0 pages dirty in requested
>>>>         range. Vendor driver should return -1 to mark all pages in the section
>>>>         as dirty
>>>>
>>>> Migration region looks like:
>>>>  ------------------------------------------------------------------
>>>> |vfio_device_migration_info|    data section                      |
>>>> |                          |     ///////////////////////////////  |
>>>>  ------------------------------------------------------------------
>>>>  ^                              ^                              ^
>>>>  offset 0-trapped part        data_offset                 data_size
>>>>
>>>> Data section is always followed by vfio_device_migration_info
>>>> structure in the region, so data_offset will always be none-0.
>>>> Offset from where data is copied is decided by kernel driver, data
>>>> section can be trapped or mapped depending on how kernel driver
>>>> defines data section. If mmapped, then data_offset should be page
>>>> aligned, where as initial section which contain
>>>> vfio_device_migration_info structure might not end at offset which
>>>> is page aligned.
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 71 insertions(+)
>>>>
>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>> index 24f505199f83..274ec477eb82 100644
>>>> --- a/linux-headers/linux/vfio.h
>>>> +++ b/linux-headers/linux/vfio.h
>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>>>>   */
>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>>>>  
>>>> +/* Migration region type and sub-type */
>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>>>> +
>>>> +/**
>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>>> + * information. Field accesses from this structure are only supported at their
>>>> + * native width and alignment, otherwise should return error.
>>>> + *
>>>> + * device_state: (read/write)
>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>>> + *      to. If device state transition fails, write to this field return error.
>>>> + *      It consists of 3 bits:
>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>>>> + *        device before write returns.
>>>> + *      - If bit 1 set, indicates _SAVING state.
>>>> + *      - If bit 2 set, indicates _RESUMING state.
>>>> + *
>>>> + * pending bytes: (read only)
>>>> + *      Read pending bytes yet to be migrated from vendor driver
>>>> + *
>>>> + * data_offset: (read only)
>>>> + *      User application should read data_offset in migration region from where
>>>> + *      user application should read data during _SAVING state or write data
>>>> + *      during _RESUMING state.
>>>> + *
>>>> + * data_size: (read/write)
>>>> + *      User application should read data_size to know data copied in migration
>>>> + *      region during _SAVING state and write size of data copied in migration
>>>> + *      region during _RESUMING state.
>>>> + *
>>>> + * start_pfn: (write only)
>>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
>>>> + *      _SAVING state.
>>>> + *
>>>> + * page_size: (write only)
>>>> + *      User application should write the page_size of pfn.
>>>> + *
>>>> + * total_pfns: (write only)
>>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
>>>> + *
>>>> + * copied_pfns: (read only)
>>>> + *      pfn count for which dirty bitmap is copied to migration region.
>>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
>>>> + *      marked dirty in migration region.
>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
>>>> + *      range.
>>>> + *      Vendor driver should return -1 to mark all pages in the section as
>>>> + *      dirty.  
>>>
>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
>>> any order and then the read of copied_pfns is what triggers the
>>> snapshot?  
>>
>> Yes.
>>
>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>> can write them once and get repeated refreshes of the dirty bitmap by
>>> re-reading copied_pfns?  
>>
>> Yes and that bitmap should be for given range (from start_pfn till
>> start_pfn + tolal_pfns).
>> Re-reading of copied_pfns is to handle the case where it might be
>> possible that vendor driver reserved area for bitmap < total bitmap size
>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>> is, there are no pages dirty in rest of the range)
> 
> So reading copied_pfns triggers the data range to be updated, but the
> caller cannot assume it to be synchronous and uses total_pfns to poll
> that the update is complete?  How does the vendor driver differentiate
> the user polling for the previous update to finish versus requesting a
> new update?
> 

Write on start_pfn/page_size/total_pfns, then read on copied_pfns
indicates new update, where as sequential read on copied_pfns indicates
polling for previous update.

>>>  What's the advantage to returning -1 versus
>>> returning copied_pfns == total_pfns?
>>>   
>>
>> If all bits in bitmap are 1, then return -1, that is, all pages in the
>> given range to be marked dirty.
>>
>> If all bits in bitmap are 0, then return 0, that is, no page to be
>> marked dirty in given range or rest of the range.
>>
>> Otherwise vendor driver should return copied_pfns == total_pfn and
>> provide bitmap for total_pfn, which means that bitmap copied for given
>> range contains information for all pages where some bits are 0s and some
>> are 1s.
> 
> Given that the vendor driver can indicate zero dirty pfns and all dirty
> pfns, I interpreted copied_pfns as a synchronous operation where the
> return value could indicate the number of dirty pages within the
> requested range.
> 
>>> If the user then wants to switch back to reading device migration
>>> state, is it a read of data_size that switches the data area back to
>>> making that address space available?   
>>
>> No, Its not just read(data_size), before that there is a
>> read(data_offset). If Vendor driver wants to have different sub-regions
>> for device data and dirty page bitmap, vendor driver should return
>> corresponding offset on read(data_offset).
> 
> The dynamic use of data_offset was not at all evident to me until I got
> further into the QEMU series.  The usage model needs to be well
> specified in the linux header.  I infer this behavior is such that the
> vendor driver can effectively identity map portions of device memory
> and the user will restore to the same offset.  I suppose this is a
> valid approach but it seems specifically tuned to devices which allow
> full direct mapping, whereas many devices have more device memory than
> is directly map'able and state beyond simple device memory.  Does this
> model unnecessarily burden such devices?  It is a nice feature that
> they data range can contain both mmap'd sections and trapped sections
> and by adjusting data_offset the vendor driver can select which is
> currently being used, but we really need to formalize all these details.
> 
>>> In each case, is it the user's
>>> responsibility to consume all the data provided before triggering the
>>> next data area?> For example, if I ask for a range of dirty bitmap, the
>>> vendor driver will provide that range and and clear it, such that the
>>> pages are considered clean regardless of whether the user consumed the
>>> data area.    
>>
>> Yes.
>>
>>> Likewise if the user asks for data_size, that would be
>>> deducted from pending_bytes regardless of the user reading the data
>>> area.   
>>
>> User should read data before deducting data_size from pending_bytes.
> 
> The user deducts data_size form pending_bytes?  pending_bytes is
> read-only, how does this work?

Pending_bytes is readonly from migration region. User should read device
data while pending_bytes > 0. How User would decide to iterate or not?
User will have to check if previously read pending_bytes - data_size is
still > 0, if yes then iterate. Before iterating, its users
responsibility to read data from data section.

> 
>> From vendor driver point of view, data_size will be deducted from
>> pending_bytes once data is copied to data region.
> 
> If the data is entirely from an mmap'd range, how does the vendor
> driver know when the data is copied?
> 
>>> Are there any read side-effects to pending_bytes?  
>>
>> No, its query to vendor driver about pending bytes yet to be
>> migrated/read from vendor driver.
>>
>>>  Are there
>>> read side-effects to the data area on SAVING?  
>>
>> No.
> 
> So the vendor driver must make an assumption somewhere in the usage
> protocol that it's the user's responsibility, this needs to be
> specified.
> 

Ok.

>>>  Are there write
>>> side-effects on RESUMING, or is it only the write of data_size that
>>> triggers the buffer to be consumed?  
>>
>> Its write(data_size) triggers the buffer to be consumed, if region is
>> mmaped, then data is already copied to region, if its trapped then
>> following writes from data_offset is data to be consumed.
>>
>>>  Is it the user's responsibility to
>>> write only full "packets" on RESUMING?  For example if the SAVING side
>>> provides data_size X, that full data_size X must be written to the
>>> RESUMING side, the user cannot write half of it to the data area on the
>>> RESUMING side, write data_size with X/2, write the second half, and
>>> again write X/2.  IOW, the data_size "packet" is indivisible at the
>>> point of resuming.
>>>   
>>
>> If source and destination are compatible or of same driver version, then
>> if user is reading data_size X at source/SAVING, destination should be
>> able to consume data_size X at restoring/RESUMING. Then why should user
>> write X/2 and iterate?
> 
> Because users do things we don't expect ;)  Maybe they decide to chunk
> the data into smaller packets over the network, but the receiving side
> would rather write the packet immediately rather than queuing it.
> OTOH, does it necessarily matter so long as data_size is written on
> completion of a full "packet"?
> 

Doesn't matter. As long as data is written in same order as it was read,
size doesn't matter.

>>> What are the ordering requirements?  Must the user write data_size
>>> packets in the same order that they're read, or is it the vendor
>>> driver's responsibility to include sequence information and allow
>>> restore in any order?
>>>   
>>
>> For user, data is opaque. User should write data in the same order as he
>> received.
> 
> Let's make sure that's specified.
> 

Ok.

Thanks,
Kirti

>>>> + */
>>>> +
>>>> +struct vfio_device_migration_info {
>>>> +        __u32 device_state;         /* VFIO device state */
>>>> +#define VFIO_DEVICE_STATE_STOPPED   (0)  
>>>
>>> We need to be careful with how this is used if we want to leave the
>>> possibility of using the remaining 29 bits of this register.  Maybe we
>>> want to define VFIO_DEVICE_STATE_MASK and be sure that we only do
>>> read-modify-write ops within the mask (ex. set_bit and clear_bit
>>> helpers).  
>>
>> Makes sense, I'll do changes in next iteration.
>>
>>>  Also, above we define STOPPED to indicate simply
>>> not-RUNNING, but here it seems STOPPED means not-RUNNING, not-SAVING,
>>> and not-RESUMING.
>>>   
>>
>> That's correct.
>>
>>>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>>>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>>>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>>>> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
>>>> +                                     VFIO_DEVICE_STATE_RESUMING)
>>>> +        __u32 reserved;
>>>> +        __u64 pending_bytes;
>>>> +        __u64 data_offset;  
>>>
>>> Placing the data more than 4GB into the region seems a bit absurd, so
>>> this could probably be a __u32 and take the place of the reserved field.
>>>   
>>
>> Is there a maximum limit on VFIO region size?
>> There isn't any such limit, right? Vendor driver can define region of
>> any size and then place data section anywhere in the region. I prefer to
>> keep it __u64.
> 
> We have a single file descriptor for all accesses to the device, which
> gives us quite a bit of per device address space.  As I mention above,
> it wasn't clear to me that data_offset is used dynamically until I got
> further into the series, so it seemed strange to me that we'd choose
> such a large offset, but given my new understanding I agree it requires
> a __u64 currently.  Thanks,
> 
> Alex
> 
>>>> +        __u64 data_size;
>>>> +        __u64 start_pfn;
>>>> +        __u64 page_size;
>>>> +        __u64 total_pfns;
>>>> +        __s64 copied_pfns;  
>>>
>>> If this is signed so that we can get -1 then the user could
>>> theoretically specify total_pfns that we can't represent in
>>> copied_pfns.  Probably best to use unsigned and specify ~0 rather than
>>> -1.
>>>   
>>
>> Ok.
>>
>>> Overall this looks like a good interface, but we need to more
>>> thoroughly define the protocol with the data area and set expectations
>>> we're placing on the user and vendor driver.  There should be no usage
>>> assumptions, it should all be spelled out.  Thanks,
>>>  
>>
>> Thanks for your feedback. I'll update comments above to be more specific.
>>
>> Thanks,
>> Kirti
>>
>>> Alex
>>>   
>>>> +} __attribute__((packed));
>>>> +
>>>>  /*
>>>>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>>>   * which allows direct access to non-MSIX registers which happened to be within  
>>>   
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-21 15:16       ` Alex Williamson
@ 2019-06-21 19:38         ` Kirti Wankhede
  2019-06-21 20:02           ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21 19:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 6/21/2019 8:46 PM, Alex Williamson wrote:
> On Fri, 21 Jun 2019 12:08:26 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/21/2019 12:55 AM, Alex Williamson wrote:
>>> On Thu, 20 Jun 2019 20:07:36 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>>>> functions. These functions handles pre-copy and stop-and-copy phase.
>>>>
>>>> In _SAVING|_RUNNING device state or pre-copy phase:
>>>> - read pending_bytes
>>>> - read data_offset - indicates kernel driver to write data to staging
>>>>   buffer which is mmapped.  
>>>
>>> Why is data_offset the trigger rather than data_size?  It seems that
>>> data_offset can't really change dynamically since it might be mmap'd,
>>> so it seems unnatural to bother re-reading it.
>>>   
>>
>> Vendor driver can change data_offset, he can have different data_offset
>> for device data and dirty pages bitmap.
>>
>>>> - read data_size - amount of data in bytes written by vendor driver in migration
>>>>   region.
>>>> - if data section is trapped, pread() number of bytes in data_size, from
>>>>   data_offset.
>>>> - if data section is mmaped, read mmaped buffer of size data_size.
>>>> - Write data packet to file stream as below:
>>>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>>>> VFIO_MIG_FLAG_END_OF_STATE }
>>>>
>>>> In _SAVING device state or stop-and-copy phase
>>>> a. read config space of device and save to migration file stream. This
>>>>    doesn't need to be from vendor driver. Any other special config state
>>>>    from driver can be saved as data in following iteration.
>>>> b. read pending_bytes - indicates kernel driver to write data to staging
>>>>    buffer which is mmapped.  
>>>
>>> Is it pending_bytes or data_offset that triggers the write out of
>>> data?  Why pending_bytes vs data_size?  I was interpreting
>>> pending_bytes as the total data size while data_size is the size
>>> available to read now, so assumed data_size would be more closely
>>> aligned to making the data available.
>>>   
>>
>> Sorry, that's my mistake while editing, its read data_offset as in above
>> case.
>>
>>>> c. read data_size - amount of data in bytes written by vendor driver in
>>>>    migration region.
>>>> d. if data section is trapped, pread() from data_offset of size data_size.
>>>> e. if data section is mmaped, read mmaped buffer of size data_size.  
>>>
>>> Should this read as "pread() from data_offset of data_size, or
>>> optionally if mmap is supported on the data area, read data_size from
>>> start of mapped buffer"?  IOW, pread should always work.  Same in
>>> previous section.
>>>   
>>
>> ok. I'll update.
>>
>>>> f. Write data packet as below:
>>>>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>>>> g. iterate through steps b to f until (pending_bytes > 0)  
>>>
>>> s/until/while/  
>>
>> Ok.
>>
>>>   
>>>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>>>
>>>> .save_live_iterate runs outside the iothread lock in the migration case, which
>>>> could race with asynchronous call to get dirty page list causing data corruption
>>>> in mapped migration region. Mutex added here to serial migration buffer read
>>>> operation.  
>>>
>>> Would we be ahead to use different offsets within the region for device
>>> data vs dirty bitmap to avoid this?
>>>  
>>
>> Lock will still be required to serialize the read/write operations on
>> vfio_device_migration_info structure in the region.
>>
>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 212 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index fe0887c27664..0a2f30872316 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>>>      return 0;
>>>>  }
>>>>  
>>>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    VFIORegion *region = &migration->region.buffer;
>>>> +    uint64_t data_offset = 0, data_size = 0;
>>>> +    int ret;
>>>> +
>>>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>> +                                             data_offset));
>>>> +    if (ret != sizeof(data_offset)) {
>>>> +        error_report("Failed to get migration buffer data offset %d",
>>>> +                     ret);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>> +                                             data_size));
>>>> +    if (ret != sizeof(data_size)) {
>>>> +        error_report("Failed to get migration buffer data size %d",
>>>> +                     ret);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    if (data_size > 0) {
>>>> +        void *buf = NULL;
>>>> +        bool buffer_mmaped = false;
>>>> +
>>>> +        if (region->mmaps) {
>>>> +            int i;
>>>> +
>>>> +            for (i = 0; i < region->nr_mmaps; i++) {
>>>> +                if ((data_offset >= region->mmaps[i].offset) &&
>>>> +                    (data_offset < region->mmaps[i].offset +
>>>> +                                   region->mmaps[i].size)) {
>>>> +                    buf = region->mmaps[i].mmap + (data_offset -
>>>> +                                                   region->mmaps[i].offset);  
>>>
>>> So you're expecting that data_offset is somewhere within the data
>>> area.  Why doesn't the data always simply start at the beginning of the
>>> data area?  ie. data_offset would coincide with the beginning of the
>>> mmap'able area (if supported) and be static.  Does this enable some
>>> functionality in the vendor driver?  
>>
>> Do you want to enforce that to vendor driver?
>> From the feedback on previous version I thought vendor driver should
>> define data_offset within the region
>> "I'd suggest that the vendor driver expose a read-only
>> data_offset that matches a sparse mmap capability entry should the
>> driver support mmap.  The use should always read or write data from the
>> vendor defined data_offset"
>>
>> This also adds flexibility to vendor driver such that vendor driver can
>> define different data_offset for device data and dirty page bitmap
>> within same mmaped region.
> 
> I agree, it adds flexibility, the protocol was not evident to me until
> I got here though.
> 
>>>  Does resume data need to be
>>> written from the same offset where it's read?  
>>
>> No, resume data should be written from the data_offset that vendor
>> driver provided during resume.
> 
> s/resume/save/?
> 
> Or is this saying that on resume that the vendor driver is requesting a
> specific block of data via data_offset? 

Correct.

Thanks,
Kirti

> I think resume is going to be
> directed by the user, writing in the same order they received the
> data.  Thanks,
> 
> Alex
> 
>>>> +                    buffer_mmaped = true;
>>>> +                    break;
>>>> +                }
>>>> +            }
>>>> +        }
>>>> +
>>>> +        if (!buffer_mmaped) {
>>>> +            buf = g_malloc0(data_size);
>>>> +            ret = pread(vbasedev->fd, buf, data_size,
>>>> +                        region->fd_offset + data_offset);
>>>> +            if (ret != data_size) {
>>>> +                error_report("Failed to get migration data %d", ret);
>>>> +                g_free(buf);
>>>> +                return -EINVAL;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        qemu_put_be64(f, data_size);
>>>> +        qemu_put_buffer(f, buf, data_size);
>>>> +
>>>> +        if (!buffer_mmaped) {
>>>> +            g_free(buf);
>>>> +        }
>>>> +        migration->pending_bytes -= data_size;
>>>> +    } else {
>>>> +        qemu_put_be64(f, data_size);
>>>> +    }
>>>> +
>>>> +    ret = qemu_file_get_error(f);
>>>> +
>>>> +    return data_size;
>>>> +}
>>>> +
>>>> +static int vfio_update_pending(VFIODevice *vbasedev)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    VFIORegion *region = &migration->region.buffer;
>>>> +    uint64_t pending_bytes = 0;
>>>> +    int ret;
>>>> +
>>>> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>> +                                             pending_bytes));  
>>>
>>> Did this trigger the vendor driver to write out to the data area when
>>> we don't need it to?
>>>   
>>
>> No, as I mentioned above, I'll update the description.
>>
>> Thanks,
>> Kirti
>>
>>>> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
>>>> +        error_report("Failed to get pending bytes %d", ret);
>>>> +        migration->pending_bytes = 0;
>>>> +        return (ret < 0) ? ret : -EINVAL;
>>>> +    }
>>>> +
>>>> +    migration->pending_bytes = pending_bytes;
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>>>> +
>>>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>>>> +        vfio_pci_save_config(vbasedev, f);
>>>> +    }
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +
>>>> +    return qemu_file_get_error(f);
>>>> +}
>>>> +
>>>>  /* ---------------------------------------------------------------------- */
>>>>  
>>>>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>>>> @@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
>>>>      }
>>>>  }
>>>>  
>>>> +static void vfio_save_pending(QEMUFile *f, void *opaque,
>>>> +                              uint64_t threshold_size,
>>>> +                              uint64_t *res_precopy_only,
>>>> +                              uint64_t *res_compatible,
>>>> +                              uint64_t *res_postcopy_only)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    int ret;
>>>> +
>>>> +    ret = vfio_update_pending(vbasedev);
>>>> +    if (ret) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
>>>> +        *res_precopy_only += migration->pending_bytes;
>>>> +    } else {
>>>> +        *res_postcopy_only += migration->pending_bytes;
>>>> +    }
>>>> +    *res_compatible += 0;
>>>> +}
>>>> +
>>>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    int ret;
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>>>> +
>>>> +    qemu_mutex_lock(&migration->lock);
>>>> +    ret = vfio_save_buffer(f, vbasedev);
>>>> +    qemu_mutex_unlock(&migration->lock);
>>>> +
>>>> +    if (ret < 0) {
>>>> +        error_report("vfio_save_buffer failed %s",
>>>> +                     strerror(errno));
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +
>>>> +    ret = qemu_file_get_error(f);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    int ret;
>>>> +
>>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
>>>> +    if (ret) {
>>>> +        error_report("Failed to set state STOP and SAVING");
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    ret = vfio_save_device_config_state(f, opaque);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    ret = vfio_update_pending(vbasedev);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    while (migration->pending_bytes > 0) {
>>>> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>>>> +        ret = vfio_save_buffer(f, vbasedev);
>>>> +        if (ret < 0) {
>>>> +            error_report("Failed to save buffer");
>>>> +            return ret;
>>>> +        } else if (ret == 0) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = vfio_update_pending(vbasedev);
>>>> +        if (ret) {
>>>> +            return ret;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +
>>>> +    ret = qemu_file_get_error(f);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
>>>> +    if (ret) {
>>>> +        error_report("Failed to set state STOPPED");
>>>> +        return ret;
>>>> +    }
>>>> +    return ret;
>>>> +}
>>>> +
>>>>  static SaveVMHandlers savevm_vfio_handlers = {
>>>>      .save_setup = vfio_save_setup,
>>>>      .save_cleanup = vfio_save_cleanup,
>>>> +    .save_live_pending = vfio_save_pending,
>>>> +    .save_live_iterate = vfio_save_iterate,
>>>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>>>>  };
>>>>  
>>>>  /* ---------------------------------------------------------------------- */  
>>>   
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-21 19:35         ` Kirti Wankhede
@ 2019-06-21 20:00           ` Alex Williamson
  2019-06-21 20:30             ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-21 20:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 22 Jun 2019 01:05:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/21/2019 8:33 PM, Alex Williamson wrote:
> > On Fri, 21 Jun 2019 11:22:15 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 6/20/2019 10:48 PM, Alex Williamson wrote:  
> >>> On Thu, 20 Jun 2019 20:07:29 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> - Defined MIGRATION region type and sub-type.
> >>>> - Used 3 bits to define VFIO device states.
> >>>>     Bit 0 => _RUNNING
> >>>>     Bit 1 => _SAVING
> >>>>     Bit 2 => _RESUMING
> >>>>     Combination of these bits defines VFIO device's state during migration
> >>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
> >>>>     _RUNNING => Normal VFIO device running state.
> >>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
> >>>>                           saving state of device i.e. pre-copy state
> >>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> >>>>                           save device state,i.e. stop-n-copy state
> >>>>     _RESUMING => VFIO device resuming state.
> >>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
> >>>> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>>>   offset of migration region to get/set VFIO device related information.
> >>>>   Defined members of structure and usage on read/write access:
> >>>>     * device_state: (read/write)
> >>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
> >>>>         as of now.
> >>>>     * pending bytes: (read only)
> >>>>         To get pending bytes yet to be migrated for VFIO device.
> >>>>     * data_offset: (read only)
> >>>>         To get data offset in migration from where data exist during _SAVING
> >>>>         and from where data should be written by user space application during
> >>>>          _RESUMING state
> >>>>     * data_size: (read/write)
> >>>>         To get and set size of data copied in migration region during _SAVING
> >>>>         and _RESUMING state.
> >>>>     * start_pfn, page_size, total_pfns: (write only)
> >>>>         To get bitmap of dirty pages from vendor driver from given
> >>>>         start address for total_pfns.
> >>>>     * copied_pfns: (read only)
> >>>>         To get number of pfns bitmap copied in migration region.
> >>>>         Vendor driver should copy the bitmap with bits set only for
> >>>>         pages to be marked dirty in migration region. Vendor driver
> >>>>         should return 0 if there are 0 pages dirty in requested
> >>>>         range. Vendor driver should return -1 to mark all pages in the section
> >>>>         as dirty
> >>>>
> >>>> Migration region looks like:
> >>>>  ------------------------------------------------------------------
> >>>> |vfio_device_migration_info|    data section                      |
> >>>> |                          |     ///////////////////////////////  |
> >>>>  ------------------------------------------------------------------
> >>>>  ^                              ^                              ^
> >>>>  offset 0-trapped part        data_offset                 data_size
> >>>>
> >>>> Data section is always followed by vfio_device_migration_info
> >>>> structure in the region, so data_offset will always be none-0.
> >>>> Offset from where data is copied is decided by kernel driver, data
> >>>> section can be trapped or mapped depending on how kernel driver
> >>>> defines data section. If mmapped, then data_offset should be page
> >>>> aligned, where as initial section which contain
> >>>> vfio_device_migration_info structure might not end at offset which
> >>>> is page aligned.
> >>>>
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>> ---
> >>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 71 insertions(+)
> >>>>
> >>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>>> index 24f505199f83..274ec477eb82 100644
> >>>> --- a/linux-headers/linux/vfio.h
> >>>> +++ b/linux-headers/linux/vfio.h
> >>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
> >>>>   */
> >>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> >>>>  
> >>>> +/* Migration region type and sub-type */
> >>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
> >>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> >>>> +
> >>>> +/**
> >>>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >>>> + * information. Field accesses from this structure are only supported at their
> >>>> + * native width and alignment, otherwise should return error.
> >>>> + *
> >>>> + * device_state: (read/write)
> >>>> + *      To indicate vendor driver the state VFIO device should be transitioned
> >>>> + *      to. If device state transition fails, write to this field return error.
> >>>> + *      It consists of 3 bits:
> >>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> >>>> + *        device before write returns.
> >>>> + *      - If bit 1 set, indicates _SAVING state.
> >>>> + *      - If bit 2 set, indicates _RESUMING state.
> >>>> + *
> >>>> + * pending bytes: (read only)
> >>>> + *      Read pending bytes yet to be migrated from vendor driver
> >>>> + *
> >>>> + * data_offset: (read only)
> >>>> + *      User application should read data_offset in migration region from where
> >>>> + *      user application should read data during _SAVING state or write data
> >>>> + *      during _RESUMING state.
> >>>> + *
> >>>> + * data_size: (read/write)
> >>>> + *      User application should read data_size to know data copied in migration
> >>>> + *      region during _SAVING state and write size of data copied in migration
> >>>> + *      region during _RESUMING state.
> >>>> + *
> >>>> + * start_pfn: (write only)
> >>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> >>>> + *      _SAVING state.
> >>>> + *
> >>>> + * page_size: (write only)
> >>>> + *      User application should write the page_size of pfn.
> >>>> + *
> >>>> + * total_pfns: (write only)
> >>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> >>>> + *
> >>>> + * copied_pfns: (read only)
> >>>> + *      pfn count for which dirty bitmap is copied to migration region.
> >>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> >>>> + *      marked dirty in migration region.
> >>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
> >>>> + *      range.
> >>>> + *      Vendor driver should return -1 to mark all pages in the section as
> >>>> + *      dirty.    
> >>>
> >>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
> >>> any order and then the read of copied_pfns is what triggers the
> >>> snapshot?    
> >>
> >> Yes.
> >>  
> >>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>> can write them once and get repeated refreshes of the dirty bitmap by
> >>> re-reading copied_pfns?    
> >>
> >> Yes and that bitmap should be for given range (from start_pfn till
> >> start_pfn + tolal_pfns).
> >> Re-reading of copied_pfns is to handle the case where it might be
> >> possible that vendor driver reserved area for bitmap < total bitmap size
> >> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >> is, there are no pages dirty in rest of the range)  
> > 
> > So reading copied_pfns triggers the data range to be updated, but the
> > caller cannot assume it to be synchronous and uses total_pfns to poll
> > that the update is complete?  How does the vendor driver differentiate
> > the user polling for the previous update to finish versus requesting a
> > new update?
> >   
> 
> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
> indicates new update, where as sequential read on copied_pfns indicates
> polling for previous update.

Hmm, this seems to contradict the answer to my question above where I
ask if the write fields are sticky so a user can trigger a refresh via
copied_pfns.  Does it really make sense that this is asynchronous?  Are
we going to need to specify polling intervals and completion eventfds?
data_size is synchronous, right?  Thanks,

Alex

> >>>  What's the advantage to returning -1 versus
> >>> returning copied_pfns == total_pfns?
> >>>     
> >>
> >> If all bits in bitmap are 1, then return -1, that is, all pages in the
> >> given range to be marked dirty.
> >>
> >> If all bits in bitmap are 0, then return 0, that is, no page to be
> >> marked dirty in given range or rest of the range.
> >>
> >> Otherwise vendor driver should return copied_pfns == total_pfn and
> >> provide bitmap for total_pfn, which means that bitmap copied for given
> >> range contains information for all pages where some bits are 0s and some
> >> are 1s.  
> > 
> > Given that the vendor driver can indicate zero dirty pfns and all dirty
> > pfns, I interpreted copied_pfns as a synchronous operation where the
> > return value could indicate the number of dirty pages within the
> > requested range.
> >   
> >>> If the user then wants to switch back to reading device migration
> >>> state, is it a read of data_size that switches the data area back to
> >>> making that address space available?     
> >>
> >> No, Its not just read(data_size), before that there is a
> >> read(data_offset). If Vendor driver wants to have different sub-regions
> >> for device data and dirty page bitmap, vendor driver should return
> >> corresponding offset on read(data_offset).  
> > 
> > The dynamic use of data_offset was not at all evident to me until I got
> > further into the QEMU series.  The usage model needs to be well
> > specified in the linux header.  I infer this behavior is such that the
> > vendor driver can effectively identity map portions of device memory
> > and the user will restore to the same offset.  I suppose this is a
> > valid approach but it seems specifically tuned to devices which allow
> > full direct mapping, whereas many devices have more device memory than
> > is directly map'able and state beyond simple device memory.  Does this
> > model unnecessarily burden such devices?  It is a nice feature that
> > they data range can contain both mmap'd sections and trapped sections
> > and by adjusting data_offset the vendor driver can select which is
> > currently being used, but we really need to formalize all these details.
> >   
> >>> In each case, is it the user's
> >>> responsibility to consume all the data provided before triggering the
> >>> next data area?> For example, if I ask for a range of dirty bitmap, the
> >>> vendor driver will provide that range and and clear it, such that the
> >>> pages are considered clean regardless of whether the user consumed the
> >>> data area.      
> >>
> >> Yes.
> >>  
> >>> Likewise if the user asks for data_size, that would be
> >>> deducted from pending_bytes regardless of the user reading the data
> >>> area.     
> >>
> >> User should read data before deducting data_size from pending_bytes.  
> > 
> > The user deducts data_size form pending_bytes?  pending_bytes is
> > read-only, how does this work?  
> 
> Pending_bytes is readonly from migration region. User should read device
> data while pending_bytes > 0. How User would decide to iterate or not?
> User will have to check if previously read pending_bytes - data_size is
> still > 0, if yes then iterate. Before iterating, its users
> responsibility to read data from data section.
> 
> >   
> >> From vendor driver point of view, data_size will be deducted from
> >> pending_bytes once data is copied to data region.  
> > 
> > If the data is entirely from an mmap'd range, how does the vendor
> > driver know when the data is copied?
> >   
> >>> Are there any read side-effects to pending_bytes?    
> >>
> >> No, its query to vendor driver about pending bytes yet to be
> >> migrated/read from vendor driver.
> >>  
> >>>  Are there
> >>> read side-effects to the data area on SAVING?    
> >>
> >> No.  
> > 
> > So the vendor driver must make an assumption somewhere in the usage
> > protocol that it's the user's responsibility, this needs to be
> > specified.
> >   
> 
> Ok.
> 
> >>>  Are there write
> >>> side-effects on RESUMING, or is it only the write of data_size that
> >>> triggers the buffer to be consumed?    
> >>
> >> Its write(data_size) triggers the buffer to be consumed, if region is
> >> mmaped, then data is already copied to region, if its trapped then
> >> following writes from data_offset is data to be consumed.
> >>  
> >>>  Is it the user's responsibility to
> >>> write only full "packets" on RESUMING?  For example if the SAVING side
> >>> provides data_size X, that full data_size X must be written to the
> >>> RESUMING side, the user cannot write half of it to the data area on the
> >>> RESUMING side, write data_size with X/2, write the second half, and
> >>> again write X/2.  IOW, the data_size "packet" is indivisible at the
> >>> point of resuming.
> >>>     
> >>
> >> If source and destination are compatible or of same driver version, then
> >> if user is reading data_size X at source/SAVING, destination should be
> >> able to consume data_size X at restoring/RESUMING. Then why should user
> >> write X/2 and iterate?  
> > 
> > Because users do things we don't expect ;)  Maybe they decide to chunk
> > the data into smaller packets over the network, but the receiving side
> > would rather write the packet immediately rather than queuing it.
> > OTOH, does it necessarily matter so long as data_size is written on
> > completion of a full "packet"?
> >   
> 
> Doesn't matter. As long as data is written in same order as it was read,
> size doesn't matter.
> 
> >>> What are the ordering requirements?  Must the user write data_size
> >>> packets in the same order that they're read, or is it the vendor
> >>> driver's responsibility to include sequence information and allow
> >>> restore in any order?
> >>>     
> >>
> >> For user, data is opaque. User should write data in the same order as he
> >> received.  
> > 
> > Let's make sure that's specified.
> >   
> 
> Ok.
> 
> Thanks,
> Kirti
> 
> >>>> + */
> >>>> +
> >>>> +struct vfio_device_migration_info {
> >>>> +        __u32 device_state;         /* VFIO device state */
> >>>> +#define VFIO_DEVICE_STATE_STOPPED   (0)    
> >>>
> >>> We need to be careful with how this is used if we want to leave the
> >>> possibility of using the remaining 29 bits of this register.  Maybe we
> >>> want to define VFIO_DEVICE_STATE_MASK and be sure that we only do
> >>> read-modify-write ops within the mask (ex. set_bit and clear_bit
> >>> helpers).    
> >>
> >> Makes sense, I'll do changes in next iteration.
> >>  
> >>>  Also, above we define STOPPED to indicate simply
> >>> not-RUNNING, but here it seems STOPPED means not-RUNNING, not-SAVING,
> >>> and not-RESUMING.
> >>>     
> >>
> >> That's correct.
> >>  
> >>>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> >>>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> >>>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> >>>> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> >>>> +                                     VFIO_DEVICE_STATE_RESUMING)
> >>>> +        __u32 reserved;
> >>>> +        __u64 pending_bytes;
> >>>> +        __u64 data_offset;    
> >>>
> >>> Placing the data more than 4GB into the region seems a bit absurd, so
> >>> this could probably be a __u32 and take the place of the reserved field.
> >>>     
> >>
> >> Is there a maximum limit on VFIO region size?
> >> There isn't any such limit, right? Vendor driver can define region of
> >> any size and then place data section anywhere in the region. I prefer to
> >> keep it __u64.  
> > 
> > We have a single file descriptor for all accesses to the device, which
> > gives us quite a bit of per device address space.  As I mention above,
> > it wasn't clear to me that data_offset is used dynamically until I got
> > further into the series, so it seemed strange to me that we'd choose
> > such a large offset, but given my new understanding I agree it requires
> > a __u64 currently.  Thanks,
> > 
> > Alex
> >   
> >>>> +        __u64 data_size;
> >>>> +        __u64 start_pfn;
> >>>> +        __u64 page_size;
> >>>> +        __u64 total_pfns;
> >>>> +        __s64 copied_pfns;    
> >>>
> >>> If this is signed so that we can get -1 then the user could
> >>> theoretically specify total_pfns that we can't represent in
> >>> copied_pfns.  Probably best to use unsigned and specify ~0 rather than
> >>> -1.
> >>>     
> >>
> >> Ok.
> >>  
> >>> Overall this looks like a good interface, but we need to more
> >>> thoroughly define the protocol with the data area and set expectations
> >>> we're placing on the user and vendor driver.  There should be no usage
> >>> assumptions, it should all be spelled out.  Thanks,
> >>>    
> >>
> >> Thanks for your feedback. I'll update comments above to be more specific.
> >>
> >> Thanks,
> >> Kirti
> >>  
> >>> Alex
> >>>     
> >>>> +} __attribute__((packed));
> >>>> +
> >>>>  /*
> >>>>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >>>>   * which allows direct access to non-MSIX registers which happened to be within    
> >>>     
> >   



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-21 19:38         ` Kirti Wankhede
@ 2019-06-21 20:02           ` Alex Williamson
  2019-06-21 20:07             ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-21 20:02 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 22 Jun 2019 01:08:40 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/21/2019 8:46 PM, Alex Williamson wrote:
> > On Fri, 21 Jun 2019 12:08:26 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 6/21/2019 12:55 AM, Alex Williamson wrote:  
> >>> On Thu, 20 Jun 2019 20:07:36 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> >>>> functions. These functions handles pre-copy and stop-and-copy phase.
> >>>>
> >>>> In _SAVING|_RUNNING device state or pre-copy phase:
> >>>> - read pending_bytes
> >>>> - read data_offset - indicates kernel driver to write data to staging
> >>>>   buffer which is mmapped.    
> >>>
> >>> Why is data_offset the trigger rather than data_size?  It seems that
> >>> data_offset can't really change dynamically since it might be mmap'd,
> >>> so it seems unnatural to bother re-reading it.
> >>>     
> >>
> >> Vendor driver can change data_offset, he can have different data_offset
> >> for device data and dirty pages bitmap.
> >>  
> >>>> - read data_size - amount of data in bytes written by vendor driver in migration
> >>>>   region.
> >>>> - if data section is trapped, pread() number of bytes in data_size, from
> >>>>   data_offset.
> >>>> - if data section is mmaped, read mmaped buffer of size data_size.
> >>>> - Write data packet to file stream as below:
> >>>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> >>>> VFIO_MIG_FLAG_END_OF_STATE }
> >>>>
> >>>> In _SAVING device state or stop-and-copy phase
> >>>> a. read config space of device and save to migration file stream. This
> >>>>    doesn't need to be from vendor driver. Any other special config state
> >>>>    from driver can be saved as data in following iteration.
> >>>> b. read pending_bytes - indicates kernel driver to write data to staging
> >>>>    buffer which is mmapped.    
> >>>
> >>> Is it pending_bytes or data_offset that triggers the write out of
> >>> data?  Why pending_bytes vs data_size?  I was interpreting
> >>> pending_bytes as the total data size while data_size is the size
> >>> available to read now, so assumed data_size would be more closely
> >>> aligned to making the data available.
> >>>     
> >>
> >> Sorry, that's my mistake while editing, its read data_offset as in above
> >> case.
> >>  
> >>>> c. read data_size - amount of data in bytes written by vendor driver in
> >>>>    migration region.
> >>>> d. if data section is trapped, pread() from data_offset of size data_size.
> >>>> e. if data section is mmaped, read mmaped buffer of size data_size.    
> >>>
> >>> Should this read as "pread() from data_offset of data_size, or
> >>> optionally if mmap is supported on the data area, read data_size from
> >>> start of mapped buffer"?  IOW, pread should always work.  Same in
> >>> previous section.
> >>>     
> >>
> >> ok. I'll update.
> >>  
> >>>> f. Write data packet as below:
> >>>>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> >>>> g. iterate through steps b to f until (pending_bytes > 0)    
> >>>
> >>> s/until/while/    
> >>
> >> Ok.
> >>  
> >>>     
> >>>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> >>>>
> >>>> .save_live_iterate runs outside the iothread lock in the migration case, which
> >>>> could race with asynchronous call to get dirty page list causing data corruption
> >>>> in mapped migration region. Mutex added here to serial migration buffer read
> >>>> operation.    
> >>>
> >>> Would we be ahead to use different offsets within the region for device
> >>> data vs dirty bitmap to avoid this?
> >>>    
> >>
> >> Lock will still be required to serialize the read/write operations on
> >> vfio_device_migration_info structure in the region.
> >>
> >>  
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>> ---
> >>>>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 212 insertions(+)
> >>>>
> >>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>>> index fe0887c27664..0a2f30872316 100644
> >>>> --- a/hw/vfio/migration.c
> >>>> +++ b/hw/vfio/migration.c
> >>>> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> >>>>      return 0;
> >>>>  }
> >>>>  
> >>>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> >>>> +{
> >>>> +    VFIOMigration *migration = vbasedev->migration;
> >>>> +    VFIORegion *region = &migration->region.buffer;
> >>>> +    uint64_t data_offset = 0, data_size = 0;
> >>>> +    int ret;
> >>>> +
> >>>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >>>> +                                             data_offset));
> >>>> +    if (ret != sizeof(data_offset)) {
> >>>> +        error_report("Failed to get migration buffer data offset %d",
> >>>> +                     ret);
> >>>> +        return -EINVAL;
> >>>> +    }
> >>>> +
> >>>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> >>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >>>> +                                             data_size));
> >>>> +    if (ret != sizeof(data_size)) {
> >>>> +        error_report("Failed to get migration buffer data size %d",
> >>>> +                     ret);
> >>>> +        return -EINVAL;
> >>>> +    }
> >>>> +
> >>>> +    if (data_size > 0) {
> >>>> +        void *buf = NULL;
> >>>> +        bool buffer_mmaped = false;
> >>>> +
> >>>> +        if (region->mmaps) {
> >>>> +            int i;
> >>>> +
> >>>> +            for (i = 0; i < region->nr_mmaps; i++) {
> >>>> +                if ((data_offset >= region->mmaps[i].offset) &&
> >>>> +                    (data_offset < region->mmaps[i].offset +
> >>>> +                                   region->mmaps[i].size)) {
> >>>> +                    buf = region->mmaps[i].mmap + (data_offset -
> >>>> +                                                   region->mmaps[i].offset);    
> >>>
> >>> So you're expecting that data_offset is somewhere within the data
> >>> area.  Why doesn't the data always simply start at the beginning of the
> >>> data area?  ie. data_offset would coincide with the beginning of the
> >>> mmap'able area (if supported) and be static.  Does this enable some
> >>> functionality in the vendor driver?    
> >>
> >> Do you want to enforce that to vendor driver?
> >> From the feedback on previous version I thought vendor driver should
> >> define data_offset within the region
> >> "I'd suggest that the vendor driver expose a read-only
> >> data_offset that matches a sparse mmap capability entry should the
> >> driver support mmap.  The use should always read or write data from the
> >> vendor defined data_offset"
> >>
> >> This also adds flexibility to vendor driver such that vendor driver can
> >> define different data_offset for device data and dirty page bitmap
> >> within same mmaped region.  
> > 
> > I agree, it adds flexibility, the protocol was not evident to me until
> > I got here though.
> >   
> >>>  Does resume data need to be
> >>> written from the same offset where it's read?    
> >>
> >> No, resume data should be written from the data_offset that vendor
> >> driver provided during resume.  

A)

> > s/resume/save/?

B)
 
> > Or is this saying that on resume that the vendor driver is requesting a
> > specific block of data via data_offset?   
> 
> Correct.

Which one is correct?  Thanks,

Alex

> > I think resume is going to be
> > directed by the user, writing in the same order they received the
> > data.  Thanks,


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-21 20:02           ` Alex Williamson
@ 2019-06-21 20:07             ` Kirti Wankhede
  2019-06-21 20:32               ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21 20:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 6/22/2019 1:32 AM, Alex Williamson wrote:
> On Sat, 22 Jun 2019 01:08:40 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/21/2019 8:46 PM, Alex Williamson wrote:
>>> On Fri, 21 Jun 2019 12:08:26 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 6/21/2019 12:55 AM, Alex Williamson wrote:  
>>>>> On Thu, 20 Jun 2019 20:07:36 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>>>>>> functions. These functions handles pre-copy and stop-and-copy phase.
>>>>>>
>>>>>> In _SAVING|_RUNNING device state or pre-copy phase:
>>>>>> - read pending_bytes
>>>>>> - read data_offset - indicates kernel driver to write data to staging
>>>>>>   buffer which is mmapped.    
>>>>>
>>>>> Why is data_offset the trigger rather than data_size?  It seems that
>>>>> data_offset can't really change dynamically since it might be mmap'd,
>>>>> so it seems unnatural to bother re-reading it.
>>>>>     
>>>>
>>>> Vendor driver can change data_offset, he can have different data_offset
>>>> for device data and dirty pages bitmap.
>>>>  
>>>>>> - read data_size - amount of data in bytes written by vendor driver in migration
>>>>>>   region.
>>>>>> - if data section is trapped, pread() number of bytes in data_size, from
>>>>>>   data_offset.
>>>>>> - if data section is mmaped, read mmaped buffer of size data_size.
>>>>>> - Write data packet to file stream as below:
>>>>>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>>>>>> VFIO_MIG_FLAG_END_OF_STATE }
>>>>>>
>>>>>> In _SAVING device state or stop-and-copy phase
>>>>>> a. read config space of device and save to migration file stream. This
>>>>>>    doesn't need to be from vendor driver. Any other special config state
>>>>>>    from driver can be saved as data in following iteration.
>>>>>> b. read pending_bytes - indicates kernel driver to write data to staging
>>>>>>    buffer which is mmapped.    
>>>>>
>>>>> Is it pending_bytes or data_offset that triggers the write out of
>>>>> data?  Why pending_bytes vs data_size?  I was interpreting
>>>>> pending_bytes as the total data size while data_size is the size
>>>>> available to read now, so assumed data_size would be more closely
>>>>> aligned to making the data available.
>>>>>     
>>>>
>>>> Sorry, that's my mistake while editing, its read data_offset as in above
>>>> case.
>>>>  
>>>>>> c. read data_size - amount of data in bytes written by vendor driver in
>>>>>>    migration region.
>>>>>> d. if data section is trapped, pread() from data_offset of size data_size.
>>>>>> e. if data section is mmaped, read mmaped buffer of size data_size.    
>>>>>
>>>>> Should this read as "pread() from data_offset of data_size, or
>>>>> optionally if mmap is supported on the data area, read data_size from
>>>>> start of mapped buffer"?  IOW, pread should always work.  Same in
>>>>> previous section.
>>>>>     
>>>>
>>>> ok. I'll update.
>>>>  
>>>>>> f. Write data packet as below:
>>>>>>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>>>>>> g. iterate through steps b to f until (pending_bytes > 0)    
>>>>>
>>>>> s/until/while/    
>>>>
>>>> Ok.
>>>>  
>>>>>     
>>>>>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>>>>>
>>>>>> .save_live_iterate runs outside the iothread lock in the migration case, which
>>>>>> could race with asynchronous call to get dirty page list causing data corruption
>>>>>> in mapped migration region. Mutex added here to serial migration buffer read
>>>>>> operation.    
>>>>>
>>>>> Would we be ahead to use different offsets within the region for device
>>>>> data vs dirty bitmap to avoid this?
>>>>>    
>>>>
>>>> Lock will still be required to serialize the read/write operations on
>>>> vfio_device_migration_info structure in the region.
>>>>
>>>>  
>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>> ---
>>>>>>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  1 file changed, 212 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>>> index fe0887c27664..0a2f30872316 100644
>>>>>> --- a/hw/vfio/migration.c
>>>>>> +++ b/hw/vfio/migration.c
>>>>>> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>>>>>      return 0;
>>>>>>  }
>>>>>>  
>>>>>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>>>>>> +{
>>>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>>> +    VFIORegion *region = &migration->region.buffer;
>>>>>> +    uint64_t data_offset = 0, data_size = 0;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>>>> +                                             data_offset));
>>>>>> +    if (ret != sizeof(data_offset)) {
>>>>>> +        error_report("Failed to get migration buffer data offset %d",
>>>>>> +                     ret);
>>>>>> +        return -EINVAL;
>>>>>> +    }
>>>>>> +
>>>>>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>>>> +                                             data_size));
>>>>>> +    if (ret != sizeof(data_size)) {
>>>>>> +        error_report("Failed to get migration buffer data size %d",
>>>>>> +                     ret);
>>>>>> +        return -EINVAL;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (data_size > 0) {
>>>>>> +        void *buf = NULL;
>>>>>> +        bool buffer_mmaped = false;
>>>>>> +
>>>>>> +        if (region->mmaps) {
>>>>>> +            int i;
>>>>>> +
>>>>>> +            for (i = 0; i < region->nr_mmaps; i++) {
>>>>>> +                if ((data_offset >= region->mmaps[i].offset) &&
>>>>>> +                    (data_offset < region->mmaps[i].offset +
>>>>>> +                                   region->mmaps[i].size)) {
>>>>>> +                    buf = region->mmaps[i].mmap + (data_offset -
>>>>>> +                                                   region->mmaps[i].offset);    
>>>>>
>>>>> So you're expecting that data_offset is somewhere within the data
>>>>> area.  Why doesn't the data always simply start at the beginning of the
>>>>> data area?  ie. data_offset would coincide with the beginning of the
>>>>> mmap'able area (if supported) and be static.  Does this enable some
>>>>> functionality in the vendor driver?    
>>>>
>>>> Do you want to enforce that to vendor driver?
>>>> From the feedback on previous version I thought vendor driver should
>>>> define data_offset within the region
>>>> "I'd suggest that the vendor driver expose a read-only
>>>> data_offset that matches a sparse mmap capability entry should the
>>>> driver support mmap.  The use should always read or write data from the
>>>> vendor defined data_offset"
>>>>
>>>> This also adds flexibility to vendor driver such that vendor driver can
>>>> define different data_offset for device data and dirty page bitmap
>>>> within same mmaped region.  
>>>
>>> I agree, it adds flexibility, the protocol was not evident to me until
>>> I got here though.
>>>   
>>>>>  Does resume data need to be
>>>>> written from the same offset where it's read?    
>>>>
>>>> No, resume data should be written from the data_offset that vendor
>>>> driver provided during resume.  
> 
> A)
> 
>>> s/resume/save/?
> 
> B)
>  
>>> Or is this saying that on resume that the vendor driver is requesting a
>>> specific block of data via data_offset?   
>>
>> Correct.
> 
> Which one is correct?  Thanks,
> 

B is correct.

Thanks,
Kirti


> Alex
> 
>>> I think resume is going to be
>>> directed by the user, writing in the same order they received the
>>> data.  Thanks,


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-21 20:00           ` Alex Williamson
@ 2019-06-21 20:30             ` Kirti Wankhede
  2019-06-21 22:01               ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21 20:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue


DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1;
	t=1561149027; bh=Vx3+8epE7Of2rlG84nWeSfu7DRT+uJEzppnOePdWLeI=;
	h=X-PGP-Universal:Subject:To:CC:References:X-Nvconfidentiality:From:
	 Message-ID:Date:MIME-Version:In-Reply-To:X-Originating-IP:
	 X-ClientProxiedBy:Content-Type:Content-Language:
	 Content-Transfer-Encoding;
	b=AStPyZHPNo4pVIL8/dAX4sDjtHxRlQWMef+MfuN0Ji/9iaLyNFPxxWTsnedkiYO2P
	 R6GBmUd0qonw7ekWikkjQHAG2qUGDp+v0bF8Ty28Ea29ontCiHnRNszzSExFdCCPfa
	 ie+31KINue8uKtncmacxyqVK6qZh+97PfsQ4fZTOpFT+fk0vx8PkrKv4QYFFD4HjAq
	 czjuZ2V3Kvaekug8J3z1OS19D9qBPPZPy+dLLJUKDP79rwjCt5864jGI5Dw1xrjQZ3
	 6Zf6qi/eUdohu1DG7YmV7D+UB0qDMO3GW/p9e3KpSIuI6/M7ttLcH0+0fsJ+Cpp0VM
	 HqZfnKLeJgctg==

On 6/22/2019 1:30 AM, Alex Williamson wrote:
> On Sat, 22 Jun 2019 01:05:48 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/21/2019 8:33 PM, Alex Williamson wrote:
>>> On Fri, 21 Jun 2019 11:22:15 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:  
>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>     Bit 0 => _RUNNING
>>>>>>     Bit 1 => _SAVING
>>>>>>     Bit 2 => _RESUMING
>>>>>>     Combination of these bits defines VFIO device's state during migration
>>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
>>>>>>     _RUNNING => Normal VFIO device running state.
>>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>>>>>>                           saving state of device i.e. pre-copy state
>>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>>>>>>                           save device state,i.e. stop-n-copy state
>>>>>>     _RESUMING => VFIO device resuming state.
>>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>>>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>>>>>   offset of migration region to get/set VFIO device related information.
>>>>>>   Defined members of structure and usage on read/write access:
>>>>>>     * device_state: (read/write)
>>>>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>>>>>>         as of now.
>>>>>>     * pending bytes: (read only)
>>>>>>         To get pending bytes yet to be migrated for VFIO device.
>>>>>>     * data_offset: (read only)
>>>>>>         To get data offset in migration from where data exist during _SAVING
>>>>>>         and from where data should be written by user space application during
>>>>>>          _RESUMING state
>>>>>>     * data_size: (read/write)
>>>>>>         To get and set size of data copied in migration region during _SAVING
>>>>>>         and _RESUMING state.
>>>>>>     * start_pfn, page_size, total_pfns: (write only)
>>>>>>         To get bitmap of dirty pages from vendor driver from given
>>>>>>         start address for total_pfns.
>>>>>>     * copied_pfns: (read only)
>>>>>>         To get number of pfns bitmap copied in migration region.
>>>>>>         Vendor driver should copy the bitmap with bits set only for
>>>>>>         pages to be marked dirty in migration region. Vendor driver
>>>>>>         should return 0 if there are 0 pages dirty in requested
>>>>>>         range. Vendor driver should return -1 to mark all pages in the section
>>>>>>         as dirty
>>>>>>
>>>>>> Migration region looks like:
>>>>>>  ------------------------------------------------------------------
>>>>>> |vfio_device_migration_info|    data section                      |
>>>>>> |                          |     ///////////////////////////////  |
>>>>>>  ------------------------------------------------------------------
>>>>>>  ^                              ^                              ^
>>>>>>  offset 0-trapped part        data_offset                 data_size
>>>>>>
>>>>>> Data section is always followed by vfio_device_migration_info
>>>>>> structure in the region, so data_offset will always be none-0.
>>>>>> Offset from where data is copied is decided by kernel driver, data
>>>>>> section can be trapped or mapped depending on how kernel driver
>>>>>> defines data section. If mmapped, then data_offset should be page
>>>>>> aligned, where as initial section which contain
>>>>>> vfio_device_migration_info structure might not end at offset which
>>>>>> is page aligned.
>>>>>>
>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>> ---
>>>>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  1 file changed, 71 insertions(+)
>>>>>>
>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>> index 24f505199f83..274ec477eb82 100644
>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>>>>>>   */
>>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>>>>>>  
>>>>>> +/* Migration region type and sub-type */
>>>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>>>>>> +
>>>>>> +/**
>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>>>>> + * information. Field accesses from this structure are only supported at their
>>>>>> + * native width and alignment, otherwise should return error.
>>>>>> + *
>>>>>> + * device_state: (read/write)
>>>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>>>>> + *      to. If device state transition fails, write to this field return error.
>>>>>> + *      It consists of 3 bits:
>>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>>>>>> + *        device before write returns.
>>>>>> + *      - If bit 1 set, indicates _SAVING state.
>>>>>> + *      - If bit 2 set, indicates _RESUMING state.
>>>>>> + *
>>>>>> + * pending bytes: (read only)
>>>>>> + *      Read pending bytes yet to be migrated from vendor driver
>>>>>> + *
>>>>>> + * data_offset: (read only)
>>>>>> + *      User application should read data_offset in migration region from where
>>>>>> + *      user application should read data during _SAVING state or write data
>>>>>> + *      during _RESUMING state.
>>>>>> + *
>>>>>> + * data_size: (read/write)
>>>>>> + *      User application should read data_size to know data copied in migration
>>>>>> + *      region during _SAVING state and write size of data copied in migration
>>>>>> + *      region during _RESUMING state.
>>>>>> + *
>>>>>> + * start_pfn: (write only)
>>>>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
>>>>>> + *      _SAVING state.
>>>>>> + *
>>>>>> + * page_size: (write only)
>>>>>> + *      User application should write the page_size of pfn.
>>>>>> + *
>>>>>> + * total_pfns: (write only)
>>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
>>>>>> + *
>>>>>> + * copied_pfns: (read only)
>>>>>> + *      pfn count for which dirty bitmap is copied to migration region.
>>>>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
>>>>>> + *      marked dirty in migration region.
>>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
>>>>>> + *      range.
>>>>>> + *      Vendor driver should return -1 to mark all pages in the section as
>>>>>> + *      dirty.    
>>>>>
>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
>>>>> any order and then the read of copied_pfns is what triggers the
>>>>> snapshot?    
>>>>
>>>> Yes.
>>>>  
>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>> re-reading copied_pfns?    
>>>>
>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>> start_pfn + tolal_pfns).
>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>> possible that vendor driver reserved area for bitmap < total bitmap size
>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>> is, there are no pages dirty in rest of the range)  
>>>
>>> So reading copied_pfns triggers the data range to be updated, but the
>>> caller cannot assume it to be synchronous and uses total_pfns to poll
>>> that the update is complete?  How does the vendor driver differentiate
>>> the user polling for the previous update to finish versus requesting a
>>> new update?
>>>   
>>
>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
>> indicates new update, where as sequential read on copied_pfns indicates
>> polling for previous update.
> 
> Hmm, this seems to contradict the answer to my question above where I
> ask if the write fields are sticky so a user can trigger a refresh via
> copied_pfns.

Sorry, how its contradict? pasting it again below:
>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>> re-reading copied_pfns?
>>>>
>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>> start_pfn + tolal_pfns).
>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>> possible that vendor driver reserved area for bitmap < total bitmap
size
>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>> is, there are no pages dirty in rest of the range)


>  Does it really make sense that this is asynchronous?  Are
> we going to need to specify polling intervals and completion eventfds?

No, read on copied_pfns is trapped and vender driver should block till
bitmap is copied to data section and vendor driver can keep track of
number of pfns bitmap already given to user.

Thanks,
Kirti

> data_size is synchronous, right?  Thanks,
> 
> Alex
> 
>>>>>  What's the advantage to returning -1 versus
>>>>> returning copied_pfns == total_pfns?
>>>>>     
>>>>
>>>> If all bits in bitmap are 1, then return -1, that is, all pages in the
>>>> given range to be marked dirty.
>>>>
>>>> If all bits in bitmap are 0, then return 0, that is, no page to be
>>>> marked dirty in given range or rest of the range.
>>>>
>>>> Otherwise vendor driver should return copied_pfns == total_pfn and
>>>> provide bitmap for total_pfn, which means that bitmap copied for given
>>>> range contains information for all pages where some bits are 0s and some
>>>> are 1s.  
>>>
>>> Given that the vendor driver can indicate zero dirty pfns and all dirty
>>> pfns, I interpreted copied_pfns as a synchronous operation where the
>>> return value could indicate the number of dirty pages within the
>>> requested range.
>>>   
>>>>> If the user then wants to switch back to reading device migration
>>>>> state, is it a read of data_size that switches the data area back to
>>>>> making that address space available?     
>>>>
>>>> No, Its not just read(data_size), before that there is a
>>>> read(data_offset). If Vendor driver wants to have different sub-regions
>>>> for device data and dirty page bitmap, vendor driver should return
>>>> corresponding offset on read(data_offset).  
>>>
>>> The dynamic use of data_offset was not at all evident to me until I got
>>> further into the QEMU series.  The usage model needs to be well
>>> specified in the linux header.  I infer this behavior is such that the
>>> vendor driver can effectively identity map portions of device memory
>>> and the user will restore to the same offset.  I suppose this is a
>>> valid approach but it seems specifically tuned to devices which allow
>>> full direct mapping, whereas many devices have more device memory than
>>> is directly map'able and state beyond simple device memory.  Does this
>>> model unnecessarily burden such devices?  It is a nice feature that
>>> they data range can contain both mmap'd sections and trapped sections
>>> and by adjusting data_offset the vendor driver can select which is
>>> currently being used, but we really need to formalize all these details.
>>>   
>>>>> In each case, is it the user's
>>>>> responsibility to consume all the data provided before triggering the
>>>>> next data area?> For example, if I ask for a range of dirty bitmap, the
>>>>> vendor driver will provide that range and and clear it, such that the
>>>>> pages are considered clean regardless of whether the user consumed the
>>>>> data area.      
>>>>
>>>> Yes.
>>>>  
>>>>> Likewise if the user asks for data_size, that would be
>>>>> deducted from pending_bytes regardless of the user reading the data
>>>>> area.     
>>>>
>>>> User should read data before deducting data_size from pending_bytes.  
>>>
>>> The user deducts data_size form pending_bytes?  pending_bytes is
>>> read-only, how does this work?  
>>
>> Pending_bytes is readonly from migration region. User should read device
>> data while pending_bytes > 0. How User would decide to iterate or not?
>> User will have to check if previously read pending_bytes - data_size is
>> still > 0, if yes then iterate. Before iterating, its users
>> responsibility to read data from data section.
>>
>>>   
>>>> From vendor driver point of view, data_size will be deducted from
>>>> pending_bytes once data is copied to data region.  
>>>
>>> If the data is entirely from an mmap'd range, how does the vendor
>>> driver know when the data is copied?
>>>   
>>>>> Are there any read side-effects to pending_bytes?    
>>>>
>>>> No, its query to vendor driver about pending bytes yet to be
>>>> migrated/read from vendor driver.
>>>>  
>>>>>  Are there
>>>>> read side-effects to the data area on SAVING?    
>>>>
>>>> No.  
>>>
>>> So the vendor driver must make an assumption somewhere in the usage
>>> protocol that it's the user's responsibility, this needs to be
>>> specified.
>>>   
>>
>> Ok.
>>
>>>>>  Are there write
>>>>> side-effects on RESUMING, or is it only the write of data_size that
>>>>> triggers the buffer to be consumed?    
>>>>
>>>> Its write(data_size) triggers the buffer to be consumed, if region is
>>>> mmaped, then data is already copied to region, if its trapped then
>>>> following writes from data_offset is data to be consumed.
>>>>  
>>>>>  Is it the user's responsibility to
>>>>> write only full "packets" on RESUMING?  For example if the SAVING side
>>>>> provides data_size X, that full data_size X must be written to the
>>>>> RESUMING side, the user cannot write half of it to the data area on the
>>>>> RESUMING side, write data_size with X/2, write the second half, and
>>>>> again write X/2.  IOW, the data_size "packet" is indivisible at the
>>>>> point of resuming.
>>>>>     
>>>>
>>>> If source and destination are compatible or of same driver version, then
>>>> if user is reading data_size X at source/SAVING, destination should be
>>>> able to consume data_size X at restoring/RESUMING. Then why should user
>>>> write X/2 and iterate?  
>>>
>>> Because users do things we don't expect ;)  Maybe they decide to chunk
>>> the data into smaller packets over the network, but the receiving side
>>> would rather write the packet immediately rather than queuing it.
>>> OTOH, does it necessarily matter so long as data_size is written on
>>> completion of a full "packet"?
>>>   
>>
>> Doesn't matter. As long as data is written in same order as it was read,
>> size doesn't matter.
>>
>>>>> What are the ordering requirements?  Must the user write data_size
>>>>> packets in the same order that they're read, or is it the vendor
>>>>> driver's responsibility to include sequence information and allow
>>>>> restore in any order?
>>>>>     
>>>>
>>>> For user, data is opaque. User should write data in the same order as he
>>>> received.  
>>>
>>> Let's make sure that's specified.
>>>   
>>
>> Ok.
>>
>> Thanks,
>> Kirti
>>
>>>>>> + */
>>>>>> +
>>>>>> +struct vfio_device_migration_info {
>>>>>> +        __u32 device_state;         /* VFIO device state */
>>>>>> +#define VFIO_DEVICE_STATE_STOPPED   (0)    
>>>>>
>>>>> We need to be careful with how this is used if we want to leave the
>>>>> possibility of using the remaining 29 bits of this register.  Maybe we
>>>>> want to define VFIO_DEVICE_STATE_MASK and be sure that we only do
>>>>> read-modify-write ops within the mask (ex. set_bit and clear_bit
>>>>> helpers).    
>>>>
>>>> Makes sense, I'll do changes in next iteration.
>>>>  
>>>>>  Also, above we define STOPPED to indicate simply
>>>>> not-RUNNING, but here it seems STOPPED means not-RUNNING, not-SAVING,
>>>>> and not-RESUMING.
>>>>>     
>>>>
>>>> That's correct.
>>>>  
>>>>>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>>>>>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>>>>>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>>>>>> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
>>>>>> +                                     VFIO_DEVICE_STATE_RESUMING)
>>>>>> +        __u32 reserved;
>>>>>> +        __u64 pending_bytes;
>>>>>> +        __u64 data_offset;    
>>>>>
>>>>> Placing the data more than 4GB into the region seems a bit absurd, so
>>>>> this could probably be a __u32 and take the place of the reserved field.
>>>>>     
>>>>
>>>> Is there a maximum limit on VFIO region size?
>>>> There isn't any such limit, right? Vendor driver can define region of
>>>> any size and then place data section anywhere in the region. I prefer to
>>>> keep it __u64.  
>>>
>>> We have a single file descriptor for all accesses to the device, which
>>> gives us quite a bit of per device address space.  As I mention above,
>>> it wasn't clear to me that data_offset is used dynamically until I got
>>> further into the series, so it seemed strange to me that we'd choose
>>> such a large offset, but given my new understanding I agree it requires
>>> a __u64 currently.  Thanks,
>>>
>>> Alex
>>>   
>>>>>> +        __u64 data_size;
>>>>>> +        __u64 start_pfn;
>>>>>> +        __u64 page_size;
>>>>>> +        __u64 total_pfns;
>>>>>> +        __s64 copied_pfns;    
>>>>>
>>>>> If this is signed so that we can get -1 then the user could
>>>>> theoretically specify total_pfns that we can't represent in
>>>>> copied_pfns.  Probably best to use unsigned and specify ~0 rather than
>>>>> -1.
>>>>>     
>>>>
>>>> Ok.
>>>>  
>>>>> Overall this looks like a good interface, but we need to more
>>>>> thoroughly define the protocol with the data area and set expectations
>>>>> we're placing on the user and vendor driver.  There should be no usage
>>>>> assumptions, it should all be spelled out.  Thanks,
>>>>>    
>>>>
>>>> Thanks for your feedback. I'll update comments above to be more specific.
>>>>
>>>> Thanks,
>>>> Kirti
>>>>  
>>>>> Alex
>>>>>     
>>>>>> +} __attribute__((packed));
>>>>>> +
>>>>>>  /*
>>>>>>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>>>>>   * which allows direct access to non-MSIX registers which happened to be within    
>>>>>     
>>>   
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-21 20:07             ` Kirti Wankhede
@ 2019-06-21 20:32               ` Alex Williamson
  2019-06-21 21:05                 ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-21 20:32 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 22 Jun 2019 01:37:47 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/22/2019 1:32 AM, Alex Williamson wrote:
> > On Sat, 22 Jun 2019 01:08:40 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 6/21/2019 8:46 PM, Alex Williamson wrote:  
> >>> On Fri, 21 Jun 2019 12:08:26 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 6/21/2019 12:55 AM, Alex Williamson wrote:    
> >>>>> On Thu, 20 Jun 2019 20:07:36 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>>>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> >>>>>> functions. These functions handles pre-copy and stop-and-copy phase.
> >>>>>>
> >>>>>> In _SAVING|_RUNNING device state or pre-copy phase:
> >>>>>> - read pending_bytes
> >>>>>> - read data_offset - indicates kernel driver to write data to staging
> >>>>>>   buffer which is mmapped.      
> >>>>>
> >>>>> Why is data_offset the trigger rather than data_size?  It seems that
> >>>>> data_offset can't really change dynamically since it might be mmap'd,
> >>>>> so it seems unnatural to bother re-reading it.
> >>>>>       
> >>>>
> >>>> Vendor driver can change data_offset, he can have different data_offset
> >>>> for device data and dirty pages bitmap.
> >>>>    
> >>>>>> - read data_size - amount of data in bytes written by vendor driver in migration
> >>>>>>   region.
> >>>>>> - if data section is trapped, pread() number of bytes in data_size, from
> >>>>>>   data_offset.
> >>>>>> - if data section is mmaped, read mmaped buffer of size data_size.
> >>>>>> - Write data packet to file stream as below:
> >>>>>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> >>>>>> VFIO_MIG_FLAG_END_OF_STATE }
> >>>>>>
> >>>>>> In _SAVING device state or stop-and-copy phase
> >>>>>> a. read config space of device and save to migration file stream. This
> >>>>>>    doesn't need to be from vendor driver. Any other special config state
> >>>>>>    from driver can be saved as data in following iteration.
> >>>>>> b. read pending_bytes - indicates kernel driver to write data to staging
> >>>>>>    buffer which is mmapped.      
> >>>>>
> >>>>> Is it pending_bytes or data_offset that triggers the write out of
> >>>>> data?  Why pending_bytes vs data_size?  I was interpreting
> >>>>> pending_bytes as the total data size while data_size is the size
> >>>>> available to read now, so assumed data_size would be more closely
> >>>>> aligned to making the data available.
> >>>>>       
> >>>>
> >>>> Sorry, that's my mistake while editing, its read data_offset as in above
> >>>> case.
> >>>>    
> >>>>>> c. read data_size - amount of data in bytes written by vendor driver in
> >>>>>>    migration region.
> >>>>>> d. if data section is trapped, pread() from data_offset of size data_size.
> >>>>>> e. if data section is mmaped, read mmaped buffer of size data_size.      
> >>>>>
> >>>>> Should this read as "pread() from data_offset of data_size, or
> >>>>> optionally if mmap is supported on the data area, read data_size from
> >>>>> start of mapped buffer"?  IOW, pread should always work.  Same in
> >>>>> previous section.
> >>>>>       
> >>>>
> >>>> ok. I'll update.
> >>>>    
> >>>>>> f. Write data packet as below:
> >>>>>>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> >>>>>> g. iterate through steps b to f until (pending_bytes > 0)      
> >>>>>
> >>>>> s/until/while/      
> >>>>
> >>>> Ok.
> >>>>    
> >>>>>       
> >>>>>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> >>>>>>
> >>>>>> .save_live_iterate runs outside the iothread lock in the migration case, which
> >>>>>> could race with asynchronous call to get dirty page list causing data corruption
> >>>>>> in mapped migration region. Mutex added here to serial migration buffer read
> >>>>>> operation.      
> >>>>>
> >>>>> Would we be ahead to use different offsets within the region for device
> >>>>> data vs dirty bitmap to avoid this?
> >>>>>      
> >>>>
> >>>> Lock will still be required to serialize the read/write operations on
> >>>> vfio_device_migration_info structure in the region.
> >>>>
> >>>>    
> >>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>>>> ---
> >>>>>>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>  1 file changed, 212 insertions(+)
> >>>>>>
> >>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>>>>> index fe0887c27664..0a2f30872316 100644
> >>>>>> --- a/hw/vfio/migration.c
> >>>>>> +++ b/hw/vfio/migration.c
> >>>>>> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> >>>>>>      return 0;
> >>>>>>  }
> >>>>>>  
> >>>>>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> >>>>>> +{
> >>>>>> +    VFIOMigration *migration = vbasedev->migration;
> >>>>>> +    VFIORegion *region = &migration->region.buffer;
> >>>>>> +    uint64_t data_offset = 0, data_size = 0;
> >>>>>> +    int ret;
> >>>>>> +
> >>>>>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >>>>>> +                                             data_offset));
> >>>>>> +    if (ret != sizeof(data_offset)) {
> >>>>>> +        error_report("Failed to get migration buffer data offset %d",
> >>>>>> +                     ret);
> >>>>>> +        return -EINVAL;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> >>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >>>>>> +                                             data_size));
> >>>>>> +    if (ret != sizeof(data_size)) {
> >>>>>> +        error_report("Failed to get migration buffer data size %d",
> >>>>>> +                     ret);
> >>>>>> +        return -EINVAL;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    if (data_size > 0) {
> >>>>>> +        void *buf = NULL;
> >>>>>> +        bool buffer_mmaped = false;
> >>>>>> +
> >>>>>> +        if (region->mmaps) {
> >>>>>> +            int i;
> >>>>>> +
> >>>>>> +            for (i = 0; i < region->nr_mmaps; i++) {
> >>>>>> +                if ((data_offset >= region->mmaps[i].offset) &&
> >>>>>> +                    (data_offset < region->mmaps[i].offset +
> >>>>>> +                                   region->mmaps[i].size)) {
> >>>>>> +                    buf = region->mmaps[i].mmap + (data_offset -
> >>>>>> +                                                   region->mmaps[i].offset);      
> >>>>>
> >>>>> So you're expecting that data_offset is somewhere within the data
> >>>>> area.  Why doesn't the data always simply start at the beginning of the
> >>>>> data area?  ie. data_offset would coincide with the beginning of the
> >>>>> mmap'able area (if supported) and be static.  Does this enable some
> >>>>> functionality in the vendor driver?      
> >>>>
> >>>> Do you want to enforce that to vendor driver?
> >>>> From the feedback on previous version I thought vendor driver should
> >>>> define data_offset within the region
> >>>> "I'd suggest that the vendor driver expose a read-only
> >>>> data_offset that matches a sparse mmap capability entry should the
> >>>> driver support mmap.  The use should always read or write data from the
> >>>> vendor defined data_offset"
> >>>>
> >>>> This also adds flexibility to vendor driver such that vendor driver can
> >>>> define different data_offset for device data and dirty page bitmap
> >>>> within same mmaped region.    
> >>>
> >>> I agree, it adds flexibility, the protocol was not evident to me until
> >>> I got here though.
> >>>     
> >>>>>  Does resume data need to be
> >>>>> written from the same offset where it's read?      
> >>>>
> >>>> No, resume data should be written from the data_offset that vendor
> >>>> driver provided during resume.    
> > 
> > A)
> >   
> >>> s/resume/save/?  
> > 
> > B)
> >    
> >>> Or is this saying that on resume that the vendor driver is requesting a
> >>> specific block of data via data_offset?     
> >>
> >> Correct.  
> > 
> > Which one is correct?  Thanks,
> >   
> 
> B is correct.

Shouldn't data_offset be stored in the migration stream then so we can
at least verify that source and target are in sync?  I'm not getting a
sense that this protocol involves any sort of sanity or integrity
testing on the vendor driver end, the user can just feed garbage into
the device on resume and watch the results :-\  Thanks,

Alex


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-21 20:32               ` Alex Williamson
@ 2019-06-21 21:05                 ` Kirti Wankhede
  2019-06-21 22:13                   ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-21 21:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 6/22/2019 2:02 AM, Alex Williamson wrote:
> On Sat, 22 Jun 2019 01:37:47 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/22/2019 1:32 AM, Alex Williamson wrote:
>>> On Sat, 22 Jun 2019 01:08:40 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 6/21/2019 8:46 PM, Alex Williamson wrote:  
>>>>> On Fri, 21 Jun 2019 12:08:26 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> On 6/21/2019 12:55 AM, Alex Williamson wrote:    
>>>>>>> On Thu, 20 Jun 2019 20:07:36 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>       
>>>>>>>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>>>>>>>> functions. These functions handles pre-copy and stop-and-copy phase.
>>>>>>>>
>>>>>>>> In _SAVING|_RUNNING device state or pre-copy phase:
>>>>>>>> - read pending_bytes
>>>>>>>> - read data_offset - indicates kernel driver to write data to staging
>>>>>>>>   buffer which is mmapped.      
>>>>>>>
>>>>>>> Why is data_offset the trigger rather than data_size?  It seems that
>>>>>>> data_offset can't really change dynamically since it might be mmap'd,
>>>>>>> so it seems unnatural to bother re-reading it.
>>>>>>>       
>>>>>>
>>>>>> Vendor driver can change data_offset, he can have different data_offset
>>>>>> for device data and dirty pages bitmap.
>>>>>>    
>>>>>>>> - read data_size - amount of data in bytes written by vendor driver in migration
>>>>>>>>   region.
>>>>>>>> - if data section is trapped, pread() number of bytes in data_size, from
>>>>>>>>   data_offset.
>>>>>>>> - if data section is mmaped, read mmaped buffer of size data_size.
>>>>>>>> - Write data packet to file stream as below:
>>>>>>>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>>>>>>>> VFIO_MIG_FLAG_END_OF_STATE }
>>>>>>>>
>>>>>>>> In _SAVING device state or stop-and-copy phase
>>>>>>>> a. read config space of device and save to migration file stream. This
>>>>>>>>    doesn't need to be from vendor driver. Any other special config state
>>>>>>>>    from driver can be saved as data in following iteration.
>>>>>>>> b. read pending_bytes - indicates kernel driver to write data to staging
>>>>>>>>    buffer which is mmapped.      
>>>>>>>
>>>>>>> Is it pending_bytes or data_offset that triggers the write out of
>>>>>>> data?  Why pending_bytes vs data_size?  I was interpreting
>>>>>>> pending_bytes as the total data size while data_size is the size
>>>>>>> available to read now, so assumed data_size would be more closely
>>>>>>> aligned to making the data available.
>>>>>>>       
>>>>>>
>>>>>> Sorry, that's my mistake while editing, its read data_offset as in above
>>>>>> case.
>>>>>>    
>>>>>>>> c. read data_size - amount of data in bytes written by vendor driver in
>>>>>>>>    migration region.
>>>>>>>> d. if data section is trapped, pread() from data_offset of size data_size.
>>>>>>>> e. if data section is mmaped, read mmaped buffer of size data_size.      
>>>>>>>
>>>>>>> Should this read as "pread() from data_offset of data_size, or
>>>>>>> optionally if mmap is supported on the data area, read data_size from
>>>>>>> start of mapped buffer"?  IOW, pread should always work.  Same in
>>>>>>> previous section.
>>>>>>>       
>>>>>>
>>>>>> ok. I'll update.
>>>>>>    
>>>>>>>> f. Write data packet as below:
>>>>>>>>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>>>>>>>> g. iterate through steps b to f until (pending_bytes > 0)      
>>>>>>>
>>>>>>> s/until/while/      
>>>>>>
>>>>>> Ok.
>>>>>>    
>>>>>>>       
>>>>>>>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>>>>>>>
>>>>>>>> .save_live_iterate runs outside the iothread lock in the migration case, which
>>>>>>>> could race with asynchronous call to get dirty page list causing data corruption
>>>>>>>> in mapped migration region. Mutex added here to serial migration buffer read
>>>>>>>> operation.      
>>>>>>>
>>>>>>> Would we be ahead to use different offsets within the region for device
>>>>>>> data vs dirty bitmap to avoid this?
>>>>>>>      
>>>>>>
>>>>>> Lock will still be required to serialize the read/write operations on
>>>>>> vfio_device_migration_info structure in the region.
>>>>>>
>>>>>>    
>>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>>>> ---
>>>>>>>>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>  1 file changed, 212 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>>>>> index fe0887c27664..0a2f30872316 100644
>>>>>>>> --- a/hw/vfio/migration.c
>>>>>>>> +++ b/hw/vfio/migration.c
>>>>>>>> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>>>>>>>> +{
>>>>>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>>>>> +    VFIORegion *region = &migration->region.buffer;
>>>>>>>> +    uint64_t data_offset = 0, data_size = 0;
>>>>>>>> +    int ret;
>>>>>>>> +
>>>>>>>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>>>>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>>>>>> +                                             data_offset));
>>>>>>>> +    if (ret != sizeof(data_offset)) {
>>>>>>>> +        error_report("Failed to get migration buffer data offset %d",
>>>>>>>> +                     ret);
>>>>>>>> +        return -EINVAL;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>>>>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>>>>>> +                                             data_size));
>>>>>>>> +    if (ret != sizeof(data_size)) {
>>>>>>>> +        error_report("Failed to get migration buffer data size %d",
>>>>>>>> +                     ret);
>>>>>>>> +        return -EINVAL;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    if (data_size > 0) {
>>>>>>>> +        void *buf = NULL;
>>>>>>>> +        bool buffer_mmaped = false;
>>>>>>>> +
>>>>>>>> +        if (region->mmaps) {
>>>>>>>> +            int i;
>>>>>>>> +
>>>>>>>> +            for (i = 0; i < region->nr_mmaps; i++) {
>>>>>>>> +                if ((data_offset >= region->mmaps[i].offset) &&
>>>>>>>> +                    (data_offset < region->mmaps[i].offset +
>>>>>>>> +                                   region->mmaps[i].size)) {
>>>>>>>> +                    buf = region->mmaps[i].mmap + (data_offset -
>>>>>>>> +                                                   region->mmaps[i].offset);      
>>>>>>>
>>>>>>> So you're expecting that data_offset is somewhere within the data
>>>>>>> area.  Why doesn't the data always simply start at the beginning of the
>>>>>>> data area?  ie. data_offset would coincide with the beginning of the
>>>>>>> mmap'able area (if supported) and be static.  Does this enable some
>>>>>>> functionality in the vendor driver?      
>>>>>>
>>>>>> Do you want to enforce that to vendor driver?
>>>>>> From the feedback on previous version I thought vendor driver should
>>>>>> define data_offset within the region
>>>>>> "I'd suggest that the vendor driver expose a read-only
>>>>>> data_offset that matches a sparse mmap capability entry should the
>>>>>> driver support mmap.  The use should always read or write data from the
>>>>>> vendor defined data_offset"
>>>>>>
>>>>>> This also adds flexibility to vendor driver such that vendor driver can
>>>>>> define different data_offset for device data and dirty page bitmap
>>>>>> within same mmaped region.    
>>>>>
>>>>> I agree, it adds flexibility, the protocol was not evident to me until
>>>>> I got here though.
>>>>>     
>>>>>>>  Does resume data need to be
>>>>>>> written from the same offset where it's read?      
>>>>>>
>>>>>> No, resume data should be written from the data_offset that vendor
>>>>>> driver provided during resume.    
>>>
>>> A)
>>>   
>>>>> s/resume/save/?  
>>>
>>> B)
>>>    
>>>>> Or is this saying that on resume that the vendor driver is requesting a
>>>>> specific block of data via data_offset?     
>>>>
>>>> Correct.  
>>>
>>> Which one is correct?  Thanks,
>>>   
>>
>> B is correct.
> 
> Shouldn't data_offset be stored in the migration stream then so we can
> at least verify that source and target are in sync? 

Why? data_offset is offset within migration region, nothing to do with
data stream. While resuming vendor driver can ask data at different
offset in migration region.

> I'm not getting a
> sense that this protocol involves any sort of sanity or integrity
> testing on the vendor driver end, the user can just feed garbage into
> the device on resume and watch the results :-\  Thanks,
>

vendor driver should be able to do sanity and integrity check within its
opaque data. If that sanity fails, return failure for access on field in
migration region structure.

Thanks,
Kirti

> Alex
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-21 20:30             ` Kirti Wankhede
@ 2019-06-21 22:01               ` Alex Williamson
  2019-06-24 15:00                 ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-21 22:01 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 22 Jun 2019 02:00:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
> On 6/22/2019 1:30 AM, Alex Williamson wrote:
> > On Sat, 22 Jun 2019 01:05:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 6/21/2019 8:33 PM, Alex Williamson wrote:  
> >>> On Fri, 21 Jun 2019 11:22:15 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:    
> >>>>> On Thu, 20 Jun 2019 20:07:29 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>>>> - Defined MIGRATION region type and sub-type.
> >>>>>> - Used 3 bits to define VFIO device states.
> >>>>>>     Bit 0 => _RUNNING
> >>>>>>     Bit 1 => _SAVING
> >>>>>>     Bit 2 => _RESUMING
> >>>>>>     Combination of these bits defines VFIO device's state during migration
> >>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
> >>>>>>     _RUNNING => Normal VFIO device running state.
> >>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
> >>>>>>                           saving state of device i.e. pre-copy state
> >>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> >>>>>>                           save device state,i.e. stop-n-copy state
> >>>>>>     _RESUMING => VFIO device resuming state.
> >>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
> >>>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>>>>>   offset of migration region to get/set VFIO device related information.
> >>>>>>   Defined members of structure and usage on read/write access:
> >>>>>>     * device_state: (read/write)
> >>>>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
> >>>>>>         as of now.
> >>>>>>     * pending bytes: (read only)
> >>>>>>         To get pending bytes yet to be migrated for VFIO device.
> >>>>>>     * data_offset: (read only)
> >>>>>>         To get data offset in migration from where data exist during _SAVING
> >>>>>>         and from where data should be written by user space application during
> >>>>>>          _RESUMING state
> >>>>>>     * data_size: (read/write)
> >>>>>>         To get and set size of data copied in migration region during _SAVING
> >>>>>>         and _RESUMING state.
> >>>>>>     * start_pfn, page_size, total_pfns: (write only)
> >>>>>>         To get bitmap of dirty pages from vendor driver from given
> >>>>>>         start address for total_pfns.
> >>>>>>     * copied_pfns: (read only)
> >>>>>>         To get number of pfns bitmap copied in migration region.
> >>>>>>         Vendor driver should copy the bitmap with bits set only for
> >>>>>>         pages to be marked dirty in migration region. Vendor driver
> >>>>>>         should return 0 if there are 0 pages dirty in requested
> >>>>>>         range. Vendor driver should return -1 to mark all pages in the section
> >>>>>>         as dirty
> >>>>>>
> >>>>>> Migration region looks like:
> >>>>>>  ------------------------------------------------------------------
> >>>>>> |vfio_device_migration_info|    data section                      |
> >>>>>> |                          |     ///////////////////////////////  |
> >>>>>>  ------------------------------------------------------------------
> >>>>>>  ^                              ^                              ^
> >>>>>>  offset 0-trapped part        data_offset                 data_size
> >>>>>>
> >>>>>> Data section is always followed by vfio_device_migration_info
> >>>>>> structure in the region, so data_offset will always be none-0.
> >>>>>> Offset from where data is copied is decided by kernel driver, data
> >>>>>> section can be trapped or mapped depending on how kernel driver
> >>>>>> defines data section. If mmapped, then data_offset should be page
> >>>>>> aligned, where as initial section which contain
> >>>>>> vfio_device_migration_info structure might not end at offset which
> >>>>>> is page aligned.
> >>>>>>
> >>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>>>> ---
> >>>>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>  1 file changed, 71 insertions(+)
> >>>>>>
> >>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>>>>> index 24f505199f83..274ec477eb82 100644
> >>>>>> --- a/linux-headers/linux/vfio.h
> >>>>>> +++ b/linux-headers/linux/vfio.h
> >>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
> >>>>>>   */
> >>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> >>>>>>  
> >>>>>> +/* Migration region type and sub-type */
> >>>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
> >>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >>>>>> + * information. Field accesses from this structure are only supported at their
> >>>>>> + * native width and alignment, otherwise should return error.
> >>>>>> + *
> >>>>>> + * device_state: (read/write)
> >>>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
> >>>>>> + *      to. If device state transition fails, write to this field return error.
> >>>>>> + *      It consists of 3 bits:
> >>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >>>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> >>>>>> + *        device before write returns.
> >>>>>> + *      - If bit 1 set, indicates _SAVING state.
> >>>>>> + *      - If bit 2 set, indicates _RESUMING state.
> >>>>>> + *
> >>>>>> + * pending bytes: (read only)
> >>>>>> + *      Read pending bytes yet to be migrated from vendor driver
> >>>>>> + *
> >>>>>> + * data_offset: (read only)
> >>>>>> + *      User application should read data_offset in migration region from where
> >>>>>> + *      user application should read data during _SAVING state or write data
> >>>>>> + *      during _RESUMING state.
> >>>>>> + *
> >>>>>> + * data_size: (read/write)
> >>>>>> + *      User application should read data_size to know data copied in migration
> >>>>>> + *      region during _SAVING state and write size of data copied in migration
> >>>>>> + *      region during _RESUMING state.
> >>>>>> + *
> >>>>>> + * start_pfn: (write only)
> >>>>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> >>>>>> + *      _SAVING state.
> >>>>>> + *
> >>>>>> + * page_size: (write only)
> >>>>>> + *      User application should write the page_size of pfn.
> >>>>>> + *
> >>>>>> + * total_pfns: (write only)
> >>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> >>>>>> + *
> >>>>>> + * copied_pfns: (read only)
> >>>>>> + *      pfn count for which dirty bitmap is copied to migration region.
> >>>>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> >>>>>> + *      marked dirty in migration region.
> >>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
> >>>>>> + *      range.
> >>>>>> + *      Vendor driver should return -1 to mark all pages in the section as
> >>>>>> + *      dirty.      
> >>>>>
> >>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
> >>>>> any order and then the read of copied_pfns is what triggers the
> >>>>> snapshot?      
> >>>>
> >>>> Yes.
> >>>>    
> >>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>> re-reading copied_pfns?      
> >>>>
> >>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>> start_pfn + tolal_pfns).
> >>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>> possible that vendor driver reserved area for bitmap < total bitmap size
> >>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>> is, there are no pages dirty in rest of the range)    
> >>>
> >>> So reading copied_pfns triggers the data range to be updated, but the
> >>> caller cannot assume it to be synchronous and uses total_pfns to poll
> >>> that the update is complete?  How does the vendor driver differentiate
> >>> the user polling for the previous update to finish versus requesting a
> >>> new update?
> >>>     
> >>
> >> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
> >> indicates new update, where as sequential read on copied_pfns indicates
> >> polling for previous update.  
> > 
> > Hmm, this seems to contradict the answer to my question above where I
> > ask if the write fields are sticky so a user can trigger a refresh via
> > copied_pfns.  
> 
> Sorry, how its contradict? pasting it again below:
> >>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>> re-reading copied_pfns?  
> >>>>
> >>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>> start_pfn + tolal_pfns).
> >>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>> possible that vendor driver reserved area for bitmap < total bitmap  
> size
> >>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>> is, there are no pages dirty in rest of the range)  

Sorry, I guess I misinterpreted again.  So the vendor driver can return
copied_pfns < total_pfns if it has a buffer limitation, not as an
indication of its background progress in writing out the bitmap.  Just
as a proof of concept, let's say the vendor driver has a 1 bit buffer
and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
which returns 1, so I read data_offset to find where this 1 bit is
located and then read my bit from that location.  This is the dirty
state of the first pfn.  I read copied_pfns again and it reports 2, so
I again read data_offset to find where the data is located, and it's my
job to remember that I've already read 1 bit, so 2 means there's only 1
bit available and it's the second pfn.  I read the bit.  I again read
copied_pfns, which now reports 3, I read data_offset to find the
location of the data, I remember that I've already read 2 bits, so I
read my bit into the 3rd pfn.  This seems rather clumsy.

Now that copied_pfns == total_pfns, what happens if I read copied_pfns
again?  This is actually what I thought I was asking previously.

Should we expose the pfn buffer size and fault on writes of larger than that
size, requiring the user to iterate start_pfn themselves?  Are there
any operations where the user can assume data_offset is constant?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-21 21:05                 ` Kirti Wankhede
@ 2019-06-21 22:13                   ` Alex Williamson
  2019-06-24 14:31                     ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-21 22:13 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 22 Jun 2019 02:35:02 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/22/2019 2:02 AM, Alex Williamson wrote:
> > On Sat, 22 Jun 2019 01:37:47 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 6/22/2019 1:32 AM, Alex Williamson wrote:  
> >>> On Sat, 22 Jun 2019 01:08:40 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 6/21/2019 8:46 PM, Alex Williamson wrote:    
> >>>>> On Fri, 21 Jun 2019 12:08:26 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>>>> On 6/21/2019 12:55 AM, Alex Williamson wrote:      
> >>>>>>> On Thu, 20 Jun 2019 20:07:36 +0530
> >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>         
> >>>>>>>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> >>>>>>>> functions. These functions handles pre-copy and stop-and-copy phase.
> >>>>>>>>
> >>>>>>>> In _SAVING|_RUNNING device state or pre-copy phase:
> >>>>>>>> - read pending_bytes
> >>>>>>>> - read data_offset - indicates kernel driver to write data to staging
> >>>>>>>>   buffer which is mmapped.        
> >>>>>>>
> >>>>>>> Why is data_offset the trigger rather than data_size?  It seems that
> >>>>>>> data_offset can't really change dynamically since it might be mmap'd,
> >>>>>>> so it seems unnatural to bother re-reading it.
> >>>>>>>         
> >>>>>>
> >>>>>> Vendor driver can change data_offset, he can have different data_offset
> >>>>>> for device data and dirty pages bitmap.
> >>>>>>      
> >>>>>>>> - read data_size - amount of data in bytes written by vendor driver in migration
> >>>>>>>>   region.
> >>>>>>>> - if data section is trapped, pread() number of bytes in data_size, from
> >>>>>>>>   data_offset.
> >>>>>>>> - if data section is mmaped, read mmaped buffer of size data_size.
> >>>>>>>> - Write data packet to file stream as below:
> >>>>>>>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> >>>>>>>> VFIO_MIG_FLAG_END_OF_STATE }
> >>>>>>>>
> >>>>>>>> In _SAVING device state or stop-and-copy phase
> >>>>>>>> a. read config space of device and save to migration file stream. This
> >>>>>>>>    doesn't need to be from vendor driver. Any other special config state
> >>>>>>>>    from driver can be saved as data in following iteration.
> >>>>>>>> b. read pending_bytes - indicates kernel driver to write data to staging
> >>>>>>>>    buffer which is mmapped.        
> >>>>>>>
> >>>>>>> Is it pending_bytes or data_offset that triggers the write out of
> >>>>>>> data?  Why pending_bytes vs data_size?  I was interpreting
> >>>>>>> pending_bytes as the total data size while data_size is the size
> >>>>>>> available to read now, so assumed data_size would be more closely
> >>>>>>> aligned to making the data available.
> >>>>>>>         
> >>>>>>
> >>>>>> Sorry, that's my mistake while editing, its read data_offset as in above
> >>>>>> case.
> >>>>>>      
> >>>>>>>> c. read data_size - amount of data in bytes written by vendor driver in
> >>>>>>>>    migration region.
> >>>>>>>> d. if data section is trapped, pread() from data_offset of size data_size.
> >>>>>>>> e. if data section is mmaped, read mmaped buffer of size data_size.        
> >>>>>>>
> >>>>>>> Should this read as "pread() from data_offset of data_size, or
> >>>>>>> optionally if mmap is supported on the data area, read data_size from
> >>>>>>> start of mapped buffer"?  IOW, pread should always work.  Same in
> >>>>>>> previous section.
> >>>>>>>         
> >>>>>>
> >>>>>> ok. I'll update.
> >>>>>>      
> >>>>>>>> f. Write data packet as below:
> >>>>>>>>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> >>>>>>>> g. iterate through steps b to f until (pending_bytes > 0)        
> >>>>>>>
> >>>>>>> s/until/while/        
> >>>>>>
> >>>>>> Ok.
> >>>>>>      
> >>>>>>>         
> >>>>>>>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> >>>>>>>>
> >>>>>>>> .save_live_iterate runs outside the iothread lock in the migration case, which
> >>>>>>>> could race with asynchronous call to get dirty page list causing data corruption
> >>>>>>>> in mapped migration region. Mutex added here to serial migration buffer read
> >>>>>>>> operation.        
> >>>>>>>
> >>>>>>> Would we be ahead to use different offsets within the region for device
> >>>>>>> data vs dirty bitmap to avoid this?
> >>>>>>>        
> >>>>>>
> >>>>>> Lock will still be required to serialize the read/write operations on
> >>>>>> vfio_device_migration_info structure in the region.
> >>>>>>
> >>>>>>      
> >>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>>>>>> ---
> >>>>>>>>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>  1 file changed, 212 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>>>>>>> index fe0887c27664..0a2f30872316 100644
> >>>>>>>> --- a/hw/vfio/migration.c
> >>>>>>>> +++ b/hw/vfio/migration.c
> >>>>>>>> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> >>>>>>>>      return 0;
> >>>>>>>>  }
> >>>>>>>>  
> >>>>>>>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> >>>>>>>> +{
> >>>>>>>> +    VFIOMigration *migration = vbasedev->migration;
> >>>>>>>> +    VFIORegion *region = &migration->region.buffer;
> >>>>>>>> +    uint64_t data_offset = 0, data_size = 0;
> >>>>>>>> +    int ret;
> >>>>>>>> +
> >>>>>>>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >>>>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >>>>>>>> +                                             data_offset));
> >>>>>>>> +    if (ret != sizeof(data_offset)) {
> >>>>>>>> +        error_report("Failed to get migration buffer data offset %d",
> >>>>>>>> +                     ret);
> >>>>>>>> +        return -EINVAL;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> >>>>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >>>>>>>> +                                             data_size));
> >>>>>>>> +    if (ret != sizeof(data_size)) {
> >>>>>>>> +        error_report("Failed to get migration buffer data size %d",
> >>>>>>>> +                     ret);
> >>>>>>>> +        return -EINVAL;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    if (data_size > 0) {
> >>>>>>>> +        void *buf = NULL;
> >>>>>>>> +        bool buffer_mmaped = false;
> >>>>>>>> +
> >>>>>>>> +        if (region->mmaps) {
> >>>>>>>> +            int i;
> >>>>>>>> +
> >>>>>>>> +            for (i = 0; i < region->nr_mmaps; i++) {
> >>>>>>>> +                if ((data_offset >= region->mmaps[i].offset) &&
> >>>>>>>> +                    (data_offset < region->mmaps[i].offset +
> >>>>>>>> +                                   region->mmaps[i].size)) {
> >>>>>>>> +                    buf = region->mmaps[i].mmap + (data_offset -
> >>>>>>>> +                                                   region->mmaps[i].offset);        
> >>>>>>>
> >>>>>>> So you're expecting that data_offset is somewhere within the data
> >>>>>>> area.  Why doesn't the data always simply start at the beginning of the
> >>>>>>> data area?  ie. data_offset would coincide with the beginning of the
> >>>>>>> mmap'able area (if supported) and be static.  Does this enable some
> >>>>>>> functionality in the vendor driver?        
> >>>>>>
> >>>>>> Do you want to enforce that to vendor driver?
> >>>>>> From the feedback on previous version I thought vendor driver should
> >>>>>> define data_offset within the region
> >>>>>> "I'd suggest that the vendor driver expose a read-only
> >>>>>> data_offset that matches a sparse mmap capability entry should the
> >>>>>> driver support mmap.  The use should always read or write data from the
> >>>>>> vendor defined data_offset"
> >>>>>>
> >>>>>> This also adds flexibility to vendor driver such that vendor driver can
> >>>>>> define different data_offset for device data and dirty page bitmap
> >>>>>> within same mmaped region.      
> >>>>>
> >>>>> I agree, it adds flexibility, the protocol was not evident to me until
> >>>>> I got here though.
> >>>>>       
> >>>>>>>  Does resume data need to be
> >>>>>>> written from the same offset where it's read?        
> >>>>>>
> >>>>>> No, resume data should be written from the data_offset that vendor
> >>>>>> driver provided during resume.      
> >>>
> >>> A)
> >>>     
> >>>>> s/resume/save/?    
> >>>
> >>> B)
> >>>      
> >>>>> Or is this saying that on resume that the vendor driver is requesting a
> >>>>> specific block of data via data_offset?       
> >>>>
> >>>> Correct.    
> >>>
> >>> Which one is correct?  Thanks,
> >>>     
> >>
> >> B is correct.  
> > 
> > Shouldn't data_offset be stored in the migration stream then so we can
> > at least verify that source and target are in sync?   
> 
> Why? data_offset is offset within migration region, nothing to do with
> data stream. While resuming vendor driver can ask data at different
> offset in migration region.

So the data is opaque and the sequencing is opaque, the user should
have no expectation that there's any relationship between where the
data was read from while saving versus where the target device is
requesting the next block be written while resuming.  We have a data
blob and a size and we do what we're told.

> > I'm not getting a
> > sense that this protocol involves any sort of sanity or integrity
> > testing on the vendor driver end, the user can just feed garbage into
> > the device on resume and watch the results :-\  Thanks,
> >  
> 
> vendor driver should be able to do sanity and integrity check within its
> opaque data. If that sanity fails, return failure for access on field in
> migration region structure.

Would that be a synchronous failure on the write of data_size, which
should result in the device_state moving to invalid?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/13] vfio: Add migration region initialization and finalize function
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 04/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2019-06-24 14:00   ` Cornelia Huck
  2019-06-27 14:56     ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Cornelia Huck @ 2019-06-24 14:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: yulei.zhang, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, Zhengxiao.zx, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Thu, 20 Jun 2019 20:07:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
>   patch series.
> - VFIO device supports migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 137 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  14 +++++
>  3 files changed, 152 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c

(...)

> +static int vfio_migration_region_init(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!migration) {
> +        return ret;
> +    }
> +
> +    /* Migration support added for PCI device only */
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        obj = vfio_pci_get_object(vbasedev);
> +    }

Hm... what about instead including an (optional) callback in
VFIODeviceOps that returns the object embedding the VFIODevice? No need
to adapt this code if we introduce support for a non-pci device, and the
callback function also allows to support migration in a more
finegrained way than by device type.

> +
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
> +                            migration->region.index, "migration");
> +    if (ret) {
> +        error_report("Failed to setup VFIO migration region %d: %s",
> +                      migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.buffer.size) {
> +        ret = -EINVAL;
> +        error_report("Invalid region size of VFIO migration region %d: %s",
> +                     migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}

(...)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-21 22:13                   ` Alex Williamson
@ 2019-06-24 14:31                     ` Kirti Wankhede
  0 siblings, 0 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-24 14:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

<snip>
>>>>>>>>>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>>>>>>>>>> +{
>>>>>>>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>>>>>>> +    VFIORegion *region = &migration->region.buffer;
>>>>>>>>>> +    uint64_t data_offset = 0, data_size = 0;
>>>>>>>>>> +    int ret;
>>>>>>>>>> +
>>>>>>>>>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>>>>>>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>>>>>>>> +                                             data_offset));
>>>>>>>>>> +    if (ret != sizeof(data_offset)) {
>>>>>>>>>> +        error_report("Failed to get migration buffer data offset %d",
>>>>>>>>>> +                     ret);
>>>>>>>>>> +        return -EINVAL;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>>>>>>>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>>>>>>>> +                                             data_size));
>>>>>>>>>> +    if (ret != sizeof(data_size)) {
>>>>>>>>>> +        error_report("Failed to get migration buffer data size %d",
>>>>>>>>>> +                     ret);
>>>>>>>>>> +        return -EINVAL;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    if (data_size > 0) {
>>>>>>>>>> +        void *buf = NULL;
>>>>>>>>>> +        bool buffer_mmaped = false;
>>>>>>>>>> +
>>>>>>>>>> +        if (region->mmaps) {
>>>>>>>>>> +            int i;
>>>>>>>>>> +
>>>>>>>>>> +            for (i = 0; i < region->nr_mmaps; i++) {
>>>>>>>>>> +                if ((data_offset >= region->mmaps[i].offset) &&
>>>>>>>>>> +                    (data_offset < region->mmaps[i].offset +
>>>>>>>>>> +                                   region->mmaps[i].size)) {
>>>>>>>>>> +                    buf = region->mmaps[i].mmap + (data_offset -
>>>>>>>>>> +                                                   region->mmaps[i].offset);        
>>>>>>>>>
>>>>>>>>> So you're expecting that data_offset is somewhere within the data
>>>>>>>>> area.  Why doesn't the data always simply start at the beginning of the
>>>>>>>>> data area?  ie. data_offset would coincide with the beginning of the
>>>>>>>>> mmap'able area (if supported) and be static.  Does this enable some
>>>>>>>>> functionality in the vendor driver?        
>>>>>>>>
>>>>>>>> Do you want to enforce that to vendor driver?
>>>>>>>> From the feedback on previous version I thought vendor driver should
>>>>>>>> define data_offset within the region
>>>>>>>> "I'd suggest that the vendor driver expose a read-only
>>>>>>>> data_offset that matches a sparse mmap capability entry should the
>>>>>>>> driver support mmap.  The use should always read or write data from the
>>>>>>>> vendor defined data_offset"
>>>>>>>>
>>>>>>>> This also adds flexibility to vendor driver such that vendor driver can
>>>>>>>> define different data_offset for device data and dirty page bitmap
>>>>>>>> within same mmaped region.      
>>>>>>>
>>>>>>> I agree, it adds flexibility, the protocol was not evident to me until
>>>>>>> I got here though.
>>>>>>>       
>>>>>>>>>  Does resume data need to be
>>>>>>>>> written from the same offset where it's read?        
>>>>>>>>
>>>>>>>> No, resume data should be written from the data_offset that vendor
>>>>>>>> driver provided during resume.      
>>>>>
>>>>> A)
>>>>>     
>>>>>>> s/resume/save/?    
>>>>>
>>>>> B)
>>>>>      
>>>>>>> Or is this saying that on resume that the vendor driver is requesting a
>>>>>>> specific block of data via data_offset?       
>>>>>>
>>>>>> Correct.    
>>>>>
>>>>> Which one is correct?  Thanks,
>>>>>     
>>>>
>>>> B is correct.  
>>>
>>> Shouldn't data_offset be stored in the migration stream then so we can
>>> at least verify that source and target are in sync?   
>>
>> Why? data_offset is offset within migration region, nothing to do with
>> data stream. While resuming vendor driver can ask data at different
>> offset in migration region.
> 
> So the data is opaque and the sequencing is opaque, the user should
> have no expectation that there's any relationship between where the
> data was read from while saving versus where the target device is
> requesting the next block be written while resuming.  We have a data
> blob and a size and we do what we're told.
> 

That's correct.

>>> I'm not getting a
>>> sense that this protocol involves any sort of sanity or integrity
>>> testing on the vendor driver end, the user can just feed garbage into
>>> the device on resume and watch the results :-\  Thanks,
>>>  
>>
>> vendor driver should be able to do sanity and integrity check within its
>> opaque data. If that sanity fails, return failure for access on field in
>> migration region structure.
> 
> Would that be a synchronous failure on the write of data_size, which
> should result in the device_state moving to invalid?  Thanks,
> 

If data section of migration region is mapped, then on write to
data_size, vendor driver should read staging buffer, validate data and
return sizeof(data_size) if success or return error (< 0). If data
section is trapped, then write on data section should return accordingly
on receiving data. On error, migration/restore would fail.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-21 22:01               ` Alex Williamson
@ 2019-06-24 15:00                 ` Kirti Wankhede
  2019-06-24 15:25                   ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-24 15:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 6/22/2019 3:31 AM, Alex Williamson wrote:
> On Sat, 22 Jun 2019 02:00:08 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>> On 6/22/2019 1:30 AM, Alex Williamson wrote:
>>> On Sat, 22 Jun 2019 01:05:48 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 6/21/2019 8:33 PM, Alex Williamson wrote:  
>>>>> On Fri, 21 Jun 2019 11:22:15 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:    
>>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>       
>>>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>>>     Bit 0 => _RUNNING
>>>>>>>>     Bit 1 => _SAVING
>>>>>>>>     Bit 2 => _RESUMING
>>>>>>>>     Combination of these bits defines VFIO device's state during migration
>>>>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
>>>>>>>>     _RUNNING => Normal VFIO device running state.
>>>>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>>>>>>>>                           saving state of device i.e. pre-copy state
>>>>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>>>>>>>>                           save device state,i.e. stop-n-copy state
>>>>>>>>     _RESUMING => VFIO device resuming state.
>>>>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>>>>>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>>>>>>>   offset of migration region to get/set VFIO device related information.
>>>>>>>>   Defined members of structure and usage on read/write access:
>>>>>>>>     * device_state: (read/write)
>>>>>>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>>>>>>>>         as of now.
>>>>>>>>     * pending bytes: (read only)
>>>>>>>>         To get pending bytes yet to be migrated for VFIO device.
>>>>>>>>     * data_offset: (read only)
>>>>>>>>         To get data offset in migration from where data exist during _SAVING
>>>>>>>>         and from where data should be written by user space application during
>>>>>>>>          _RESUMING state
>>>>>>>>     * data_size: (read/write)
>>>>>>>>         To get and set size of data copied in migration region during _SAVING
>>>>>>>>         and _RESUMING state.
>>>>>>>>     * start_pfn, page_size, total_pfns: (write only)
>>>>>>>>         To get bitmap of dirty pages from vendor driver from given
>>>>>>>>         start address for total_pfns.
>>>>>>>>     * copied_pfns: (read only)
>>>>>>>>         To get number of pfns bitmap copied in migration region.
>>>>>>>>         Vendor driver should copy the bitmap with bits set only for
>>>>>>>>         pages to be marked dirty in migration region. Vendor driver
>>>>>>>>         should return 0 if there are 0 pages dirty in requested
>>>>>>>>         range. Vendor driver should return -1 to mark all pages in the section
>>>>>>>>         as dirty
>>>>>>>>
>>>>>>>> Migration region looks like:
>>>>>>>>  ------------------------------------------------------------------
>>>>>>>> |vfio_device_migration_info|    data section                      |
>>>>>>>> |                          |     ///////////////////////////////  |
>>>>>>>>  ------------------------------------------------------------------
>>>>>>>>  ^                              ^                              ^
>>>>>>>>  offset 0-trapped part        data_offset                 data_size
>>>>>>>>
>>>>>>>> Data section is always followed by vfio_device_migration_info
>>>>>>>> structure in the region, so data_offset will always be none-0.
>>>>>>>> Offset from where data is copied is decided by kernel driver, data
>>>>>>>> section can be trapped or mapped depending on how kernel driver
>>>>>>>> defines data section. If mmapped, then data_offset should be page
>>>>>>>> aligned, where as initial section which contain
>>>>>>>> vfio_device_migration_info structure might not end at offset which
>>>>>>>> is page aligned.
>>>>>>>>
>>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>>>> ---
>>>>>>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>  1 file changed, 71 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>>>> index 24f505199f83..274ec477eb82 100644
>>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>>>>>>>>   */
>>>>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>>>>>>>>  
>>>>>>>> +/* Migration region type and sub-type */
>>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
>>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>>>>>>> + * information. Field accesses from this structure are only supported at their
>>>>>>>> + * native width and alignment, otherwise should return error.
>>>>>>>> + *
>>>>>>>> + * device_state: (read/write)
>>>>>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>>>>>>> + *      to. If device state transition fails, write to this field return error.
>>>>>>>> + *      It consists of 3 bits:
>>>>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>>>>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>>>>>>>> + *        device before write returns.
>>>>>>>> + *      - If bit 1 set, indicates _SAVING state.
>>>>>>>> + *      - If bit 2 set, indicates _RESUMING state.
>>>>>>>> + *
>>>>>>>> + * pending bytes: (read only)
>>>>>>>> + *      Read pending bytes yet to be migrated from vendor driver
>>>>>>>> + *
>>>>>>>> + * data_offset: (read only)
>>>>>>>> + *      User application should read data_offset in migration region from where
>>>>>>>> + *      user application should read data during _SAVING state or write data
>>>>>>>> + *      during _RESUMING state.
>>>>>>>> + *
>>>>>>>> + * data_size: (read/write)
>>>>>>>> + *      User application should read data_size to know data copied in migration
>>>>>>>> + *      region during _SAVING state and write size of data copied in migration
>>>>>>>> + *      region during _RESUMING state.
>>>>>>>> + *
>>>>>>>> + * start_pfn: (write only)
>>>>>>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
>>>>>>>> + *      _SAVING state.
>>>>>>>> + *
>>>>>>>> + * page_size: (write only)
>>>>>>>> + *      User application should write the page_size of pfn.
>>>>>>>> + *
>>>>>>>> + * total_pfns: (write only)
>>>>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
>>>>>>>> + *
>>>>>>>> + * copied_pfns: (read only)
>>>>>>>> + *      pfn count for which dirty bitmap is copied to migration region.
>>>>>>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
>>>>>>>> + *      marked dirty in migration region.
>>>>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
>>>>>>>> + *      range.
>>>>>>>> + *      Vendor driver should return -1 to mark all pages in the section as
>>>>>>>> + *      dirty.      
>>>>>>>
>>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
>>>>>>> any order and then the read of copied_pfns is what triggers the
>>>>>>> snapshot?      
>>>>>>
>>>>>> Yes.
>>>>>>    
>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>>>> re-reading copied_pfns?      
>>>>>>
>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>> start_pfn + tolal_pfns).
>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>> possible that vendor driver reserved area for bitmap < total bitmap size
>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>> is, there are no pages dirty in rest of the range)    
>>>>>
>>>>> So reading copied_pfns triggers the data range to be updated, but the
>>>>> caller cannot assume it to be synchronous and uses total_pfns to poll
>>>>> that the update is complete?  How does the vendor driver differentiate
>>>>> the user polling for the previous update to finish versus requesting a
>>>>> new update?
>>>>>     
>>>>
>>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
>>>> indicates new update, where as sequential read on copied_pfns indicates
>>>> polling for previous update.  
>>>
>>> Hmm, this seems to contradict the answer to my question above where I
>>> ask if the write fields are sticky so a user can trigger a refresh via
>>> copied_pfns.  
>>
>> Sorry, how its contradict? pasting it again below:
>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>>>> re-reading copied_pfns?  
>>>>>>
>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>> start_pfn + tolal_pfns).
>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>> possible that vendor driver reserved area for bitmap < total bitmap  
>> size
>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>> is, there are no pages dirty in rest of the range)  
> 
> Sorry, I guess I misinterpreted again.  So the vendor driver can return
> copied_pfns < total_pfns if it has a buffer limitation, not as an
> indication of its background progress in writing out the bitmap.  Just
> as a proof of concept, let's say the vendor driver has a 1 bit buffer
> and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
> which returns 1, so I read data_offset to find where this 1 bit is
> located and then read my bit from that location.  This is the dirty
> state of the first pfn.  I read copied_pfns again and it reports 2,

It should report 1 to indicate its data for one pfn.

> I again read data_offset to find where the data is located, and it's my
> job to remember that I've already read 1 bit, so 2 means there's only 1
> bit available and it's the second pfn.

No.
Here 'I' means User application, right?
User application knows for how many pfns bitmap he had already received,
i.e. see 'count' in function vfio_get_dirty_page_list().

Here copied_pfns is the number of pfns for which bitmap is available in
buffer. Start address for that bitmap is then calculated by user
application as :
((start_pfn + count) * page_size)

Then QEMU calls:

cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
                                       (start_pfn + count) * page_size,
                                        copied_pfns);

>  I read the bit.  I again read
> copied_pfns, which now reports 3, I read data_offset to find the
> location of the data, I remember that I've already read 2 bits, so I
> read my bit into the 3rd pfn.  This seems rather clumsy.
>

Hope above explanation helps.

> Now that copied_pfns == total_pfns, what happens if I read copied_pfns
> again?  This is actually what I thought I was asking previously.
> 

It should return 0.

> Should we expose the pfn buffer size and fault on writes of larger than that
> size, requiring the user to iterate start_pfn themselves?

Who should fault, vendor driver or user application?

Here Vendor driver is writing data to data section.
In the steps in this patch-set, user application is incrementing
start_pfn by adding copied_pfn count.

>  Are there
> any operations where the user can assume data_offset is constant?  Thanks,
> 

We introduced data_offset not to have such assumption, better not to
keep such assumption at some place.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-24 15:00                 ` Kirti Wankhede
@ 2019-06-24 15:25                   ` Alex Williamson
  2019-06-24 18:52                     ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-24 15:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Mon, 24 Jun 2019 20:30:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/22/2019 3:31 AM, Alex Williamson wrote:
> > On Sat, 22 Jun 2019 02:00:08 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:  
> >> On 6/22/2019 1:30 AM, Alex Williamson wrote:  
> >>> On Sat, 22 Jun 2019 01:05:48 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 6/21/2019 8:33 PM, Alex Williamson wrote:    
> >>>>> On Fri, 21 Jun 2019 11:22:15 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:      
> >>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
> >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>         
> >>>>>>>> - Defined MIGRATION region type and sub-type.
> >>>>>>>> - Used 3 bits to define VFIO device states.
> >>>>>>>>     Bit 0 => _RUNNING
> >>>>>>>>     Bit 1 => _SAVING
> >>>>>>>>     Bit 2 => _RESUMING
> >>>>>>>>     Combination of these bits defines VFIO device's state during migration
> >>>>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
> >>>>>>>>     _RUNNING => Normal VFIO device running state.
> >>>>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
> >>>>>>>>                           saving state of device i.e. pre-copy state
> >>>>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> >>>>>>>>                           save device state,i.e. stop-n-copy state
> >>>>>>>>     _RESUMING => VFIO device resuming state.
> >>>>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
> >>>>>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>>>>>>>   offset of migration region to get/set VFIO device related information.
> >>>>>>>>   Defined members of structure and usage on read/write access:
> >>>>>>>>     * device_state: (read/write)
> >>>>>>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
> >>>>>>>>         as of now.
> >>>>>>>>     * pending bytes: (read only)
> >>>>>>>>         To get pending bytes yet to be migrated for VFIO device.
> >>>>>>>>     * data_offset: (read only)
> >>>>>>>>         To get data offset in migration from where data exist during _SAVING
> >>>>>>>>         and from where data should be written by user space application during
> >>>>>>>>          _RESUMING state
> >>>>>>>>     * data_size: (read/write)
> >>>>>>>>         To get and set size of data copied in migration region during _SAVING
> >>>>>>>>         and _RESUMING state.
> >>>>>>>>     * start_pfn, page_size, total_pfns: (write only)
> >>>>>>>>         To get bitmap of dirty pages from vendor driver from given
> >>>>>>>>         start address for total_pfns.
> >>>>>>>>     * copied_pfns: (read only)
> >>>>>>>>         To get number of pfns bitmap copied in migration region.
> >>>>>>>>         Vendor driver should copy the bitmap with bits set only for
> >>>>>>>>         pages to be marked dirty in migration region. Vendor driver
> >>>>>>>>         should return 0 if there are 0 pages dirty in requested
> >>>>>>>>         range. Vendor driver should return -1 to mark all pages in the section
> >>>>>>>>         as dirty
> >>>>>>>>
> >>>>>>>> Migration region looks like:
> >>>>>>>>  ------------------------------------------------------------------
> >>>>>>>> |vfio_device_migration_info|    data section                      |
> >>>>>>>> |                          |     ///////////////////////////////  |
> >>>>>>>>  ------------------------------------------------------------------
> >>>>>>>>  ^                              ^                              ^
> >>>>>>>>  offset 0-trapped part        data_offset                 data_size
> >>>>>>>>
> >>>>>>>> Data section is always followed by vfio_device_migration_info
> >>>>>>>> structure in the region, so data_offset will always be none-0.
> >>>>>>>> Offset from where data is copied is decided by kernel driver, data
> >>>>>>>> section can be trapped or mapped depending on how kernel driver
> >>>>>>>> defines data section. If mmapped, then data_offset should be page
> >>>>>>>> aligned, where as initial section which contain
> >>>>>>>> vfio_device_migration_info structure might not end at offset which
> >>>>>>>> is page aligned.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>>>>>> ---
> >>>>>>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>  1 file changed, 71 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>>>>>>> index 24f505199f83..274ec477eb82 100644
> >>>>>>>> --- a/linux-headers/linux/vfio.h
> >>>>>>>> +++ b/linux-headers/linux/vfio.h
> >>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
> >>>>>>>>   */
> >>>>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> >>>>>>>>  
> >>>>>>>> +/* Migration region type and sub-type */
> >>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
> >>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> >>>>>>>> +
> >>>>>>>> +/**
> >>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >>>>>>>> + * information. Field accesses from this structure are only supported at their
> >>>>>>>> + * native width and alignment, otherwise should return error.
> >>>>>>>> + *
> >>>>>>>> + * device_state: (read/write)
> >>>>>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
> >>>>>>>> + *      to. If device state transition fails, write to this field return error.
> >>>>>>>> + *      It consists of 3 bits:
> >>>>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >>>>>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> >>>>>>>> + *        device before write returns.
> >>>>>>>> + *      - If bit 1 set, indicates _SAVING state.
> >>>>>>>> + *      - If bit 2 set, indicates _RESUMING state.
> >>>>>>>> + *
> >>>>>>>> + * pending bytes: (read only)
> >>>>>>>> + *      Read pending bytes yet to be migrated from vendor driver
> >>>>>>>> + *
> >>>>>>>> + * data_offset: (read only)
> >>>>>>>> + *      User application should read data_offset in migration region from where
> >>>>>>>> + *      user application should read data during _SAVING state or write data
> >>>>>>>> + *      during _RESUMING state.
> >>>>>>>> + *
> >>>>>>>> + * data_size: (read/write)
> >>>>>>>> + *      User application should read data_size to know data copied in migration
> >>>>>>>> + *      region during _SAVING state and write size of data copied in migration
> >>>>>>>> + *      region during _RESUMING state.
> >>>>>>>> + *
> >>>>>>>> + * start_pfn: (write only)
> >>>>>>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> >>>>>>>> + *      _SAVING state.
> >>>>>>>> + *
> >>>>>>>> + * page_size: (write only)
> >>>>>>>> + *      User application should write the page_size of pfn.
> >>>>>>>> + *
> >>>>>>>> + * total_pfns: (write only)
> >>>>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> >>>>>>>> + *
> >>>>>>>> + * copied_pfns: (read only)
> >>>>>>>> + *      pfn count for which dirty bitmap is copied to migration region.
> >>>>>>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> >>>>>>>> + *      marked dirty in migration region.
> >>>>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
> >>>>>>>> + *      range.
> >>>>>>>> + *      Vendor driver should return -1 to mark all pages in the section as
> >>>>>>>> + *      dirty.        
> >>>>>>>
> >>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
> >>>>>>> any order and then the read of copied_pfns is what triggers the
> >>>>>>> snapshot?        
> >>>>>>
> >>>>>> Yes.
> >>>>>>      
> >>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>>>> re-reading copied_pfns?        
> >>>>>>
> >>>>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>>>> start_pfn + tolal_pfns).
> >>>>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>>>> possible that vendor driver reserved area for bitmap < total bitmap size
> >>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>>>> is, there are no pages dirty in rest of the range)      
> >>>>>
> >>>>> So reading copied_pfns triggers the data range to be updated, but the
> >>>>> caller cannot assume it to be synchronous and uses total_pfns to poll
> >>>>> that the update is complete?  How does the vendor driver differentiate
> >>>>> the user polling for the previous update to finish versus requesting a
> >>>>> new update?
> >>>>>       
> >>>>
> >>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
> >>>> indicates new update, where as sequential read on copied_pfns indicates
> >>>> polling for previous update.    
> >>>
> >>> Hmm, this seems to contradict the answer to my question above where I
> >>> ask if the write fields are sticky so a user can trigger a refresh via
> >>> copied_pfns.    
> >>
> >> Sorry, how its contradict? pasting it again below:  
> >>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>>>> re-reading copied_pfns?    
> >>>>>>
> >>>>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>>>> start_pfn + tolal_pfns).
> >>>>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>>>> possible that vendor driver reserved area for bitmap < total bitmap    
> >> size  
> >>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>>>> is, there are no pages dirty in rest of the range)    
> > 
> > Sorry, I guess I misinterpreted again.  So the vendor driver can return
> > copied_pfns < total_pfns if it has a buffer limitation, not as an
> > indication of its background progress in writing out the bitmap.  Just
> > as a proof of concept, let's say the vendor driver has a 1 bit buffer
> > and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
> > which returns 1, so I read data_offset to find where this 1 bit is
> > located and then read my bit from that location.  This is the dirty
> > state of the first pfn.  I read copied_pfns again and it reports 2,  
> 
> It should report 1 to indicate its data for one pfn.
> 
> > I again read data_offset to find where the data is located, and it's my
> > job to remember that I've already read 1 bit, so 2 means there's only 1
> > bit available and it's the second pfn.  
> 
> No.
> Here 'I' means User application, right?

Yes

> User application knows for how many pfns bitmap he had already received,
> i.e. see 'count' in function vfio_get_dirty_page_list().
> 
> Here copied_pfns is the number of pfns for which bitmap is available in
> buffer. Start address for that bitmap is then calculated by user
> application as :
> ((start_pfn + count) * page_size)
> 
> Then QEMU calls:
> 
> cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>                                        (start_pfn + count) * page_size,
>                                         copied_pfns);
> 
> >  I read the bit.  I again read
> > copied_pfns, which now reports 3, I read data_offset to find the
> > location of the data, I remember that I've already read 2 bits, so I
> > read my bit into the 3rd pfn.  This seems rather clumsy.
> >  
> 
> Hope above explanation helps.

Still seems rather clumsy, the knowledge of which bit(s) are available
in the buffer can only be known by foreknowledge of which bits have
already been read.  That seems error prone for both the user and the
vendor driver to stay in sync.

> > Now that copied_pfns == total_pfns, what happens if I read copied_pfns
> > again?  This is actually what I thought I was asking previously.
> >   
> 
> It should return 0.

Are we assuming no new pages have been dirtied?  What if pages have
been dirtied?

> > Should we expose the pfn buffer size and fault on writes of larger than that
> > size, requiring the user to iterate start_pfn themselves?  
> 
> Who should fault, vendor driver or user application?
> 
> Here Vendor driver is writing data to data section.
> In the steps in this patch-set, user application is incrementing
> start_pfn by adding copied_pfn count.

The user app is writing total_pfns to get a range, correct?  The vendor
driver could return errno on that write if total_pfns exceeds the
available buffer size.

> >  Are there
> > any operations where the user can assume data_offset is constant?  Thanks,
> >   
> 
> We introduced data_offset not to have such assumption, better not to
> keep such assumption at some place.

I agree if gives you flexibility, it'll be interesting to see how GVT-g
would make use of it, but it also seems very cumbersome for the app,
especially if we choose to implement this iterative read approach
above.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-24 15:25                   ` Alex Williamson
@ 2019-06-24 18:52                     ` Kirti Wankhede
  2019-06-24 19:01                       ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-24 18:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 6/24/2019 8:55 PM, Alex Williamson wrote:
> On Mon, 24 Jun 2019 20:30:08 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/22/2019 3:31 AM, Alex Williamson wrote:
>>> On Sat, 22 Jun 2019 02:00:08 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:  
>>>> On 6/22/2019 1:30 AM, Alex Williamson wrote:  
>>>>> On Sat, 22 Jun 2019 01:05:48 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> On 6/21/2019 8:33 PM, Alex Williamson wrote:    
>>>>>>> On Fri, 21 Jun 2019 11:22:15 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>       
>>>>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:      
>>>>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>         
>>>>>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>>>>>     Bit 0 => _RUNNING
>>>>>>>>>>     Bit 1 => _SAVING
>>>>>>>>>>     Bit 2 => _RESUMING
>>>>>>>>>>     Combination of these bits defines VFIO device's state during migration
>>>>>>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
>>>>>>>>>>     _RUNNING => Normal VFIO device running state.
>>>>>>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>>>>>>>>>>                           saving state of device i.e. pre-copy state
>>>>>>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>>>>>>>>>>                           save device state,i.e. stop-n-copy state
>>>>>>>>>>     _RESUMING => VFIO device resuming state.
>>>>>>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>>>>>>>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>>>>>>>>>   offset of migration region to get/set VFIO device related information.
>>>>>>>>>>   Defined members of structure and usage on read/write access:
>>>>>>>>>>     * device_state: (read/write)
>>>>>>>>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>>>>>>>>>>         as of now.
>>>>>>>>>>     * pending bytes: (read only)
>>>>>>>>>>         To get pending bytes yet to be migrated for VFIO device.
>>>>>>>>>>     * data_offset: (read only)
>>>>>>>>>>         To get data offset in migration from where data exist during _SAVING
>>>>>>>>>>         and from where data should be written by user space application during
>>>>>>>>>>          _RESUMING state
>>>>>>>>>>     * data_size: (read/write)
>>>>>>>>>>         To get and set size of data copied in migration region during _SAVING
>>>>>>>>>>         and _RESUMING state.
>>>>>>>>>>     * start_pfn, page_size, total_pfns: (write only)
>>>>>>>>>>         To get bitmap of dirty pages from vendor driver from given
>>>>>>>>>>         start address for total_pfns.
>>>>>>>>>>     * copied_pfns: (read only)
>>>>>>>>>>         To get number of pfns bitmap copied in migration region.
>>>>>>>>>>         Vendor driver should copy the bitmap with bits set only for
>>>>>>>>>>         pages to be marked dirty in migration region. Vendor driver
>>>>>>>>>>         should return 0 if there are 0 pages dirty in requested
>>>>>>>>>>         range. Vendor driver should return -1 to mark all pages in the section
>>>>>>>>>>         as dirty
>>>>>>>>>>
>>>>>>>>>> Migration region looks like:
>>>>>>>>>>  ------------------------------------------------------------------
>>>>>>>>>> |vfio_device_migration_info|    data section                      |
>>>>>>>>>> |                          |     ///////////////////////////////  |
>>>>>>>>>>  ------------------------------------------------------------------
>>>>>>>>>>  ^                              ^                              ^
>>>>>>>>>>  offset 0-trapped part        data_offset                 data_size
>>>>>>>>>>
>>>>>>>>>> Data section is always followed by vfio_device_migration_info
>>>>>>>>>> structure in the region, so data_offset will always be none-0.
>>>>>>>>>> Offset from where data is copied is decided by kernel driver, data
>>>>>>>>>> section can be trapped or mapped depending on how kernel driver
>>>>>>>>>> defines data section. If mmapped, then data_offset should be page
>>>>>>>>>> aligned, where as initial section which contain
>>>>>>>>>> vfio_device_migration_info structure might not end at offset which
>>>>>>>>>> is page aligned.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>>>>>> ---
>>>>>>>>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>  1 file changed, 71 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>>>>>> index 24f505199f83..274ec477eb82 100644
>>>>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>>>>>>>>>>   */
>>>>>>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>>>>>>>>>>  
>>>>>>>>>> +/* Migration region type and sub-type */
>>>>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
>>>>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>>>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>>>>>>>>> + * information. Field accesses from this structure are only supported at their
>>>>>>>>>> + * native width and alignment, otherwise should return error.
>>>>>>>>>> + *
>>>>>>>>>> + * device_state: (read/write)
>>>>>>>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>>>>>>>>> + *      to. If device state transition fails, write to this field return error.
>>>>>>>>>> + *      It consists of 3 bits:
>>>>>>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>>>>>>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>>>>>>>>>> + *        device before write returns.
>>>>>>>>>> + *      - If bit 1 set, indicates _SAVING state.
>>>>>>>>>> + *      - If bit 2 set, indicates _RESUMING state.
>>>>>>>>>> + *
>>>>>>>>>> + * pending bytes: (read only)
>>>>>>>>>> + *      Read pending bytes yet to be migrated from vendor driver
>>>>>>>>>> + *
>>>>>>>>>> + * data_offset: (read only)
>>>>>>>>>> + *      User application should read data_offset in migration region from where
>>>>>>>>>> + *      user application should read data during _SAVING state or write data
>>>>>>>>>> + *      during _RESUMING state.
>>>>>>>>>> + *
>>>>>>>>>> + * data_size: (read/write)
>>>>>>>>>> + *      User application should read data_size to know data copied in migration
>>>>>>>>>> + *      region during _SAVING state and write size of data copied in migration
>>>>>>>>>> + *      region during _RESUMING state.
>>>>>>>>>> + *
>>>>>>>>>> + * start_pfn: (write only)
>>>>>>>>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
>>>>>>>>>> + *      _SAVING state.
>>>>>>>>>> + *
>>>>>>>>>> + * page_size: (write only)
>>>>>>>>>> + *      User application should write the page_size of pfn.
>>>>>>>>>> + *
>>>>>>>>>> + * total_pfns: (write only)
>>>>>>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
>>>>>>>>>> + *
>>>>>>>>>> + * copied_pfns: (read only)
>>>>>>>>>> + *      pfn count for which dirty bitmap is copied to migration region.
>>>>>>>>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
>>>>>>>>>> + *      marked dirty in migration region.
>>>>>>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
>>>>>>>>>> + *      range.
>>>>>>>>>> + *      Vendor driver should return -1 to mark all pages in the section as
>>>>>>>>>> + *      dirty.        
>>>>>>>>>
>>>>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
>>>>>>>>> any order and then the read of copied_pfns is what triggers the
>>>>>>>>> snapshot?        
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>      
>>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>>>>>> re-reading copied_pfns?        
>>>>>>>>
>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap size
>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>> is, there are no pages dirty in rest of the range)      
>>>>>>>
>>>>>>> So reading copied_pfns triggers the data range to be updated, but the
>>>>>>> caller cannot assume it to be synchronous and uses total_pfns to poll
>>>>>>> that the update is complete?  How does the vendor driver differentiate
>>>>>>> the user polling for the previous update to finish versus requesting a
>>>>>>> new update?
>>>>>>>       
>>>>>>
>>>>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
>>>>>> indicates new update, where as sequential read on copied_pfns indicates
>>>>>> polling for previous update.    
>>>>>
>>>>> Hmm, this seems to contradict the answer to my question above where I
>>>>> ask if the write fields are sticky so a user can trigger a refresh via
>>>>> copied_pfns.    
>>>>
>>>> Sorry, how its contradict? pasting it again below:  
>>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>>>>>> re-reading copied_pfns?    
>>>>>>>>
>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap    
>>>> size  
>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>> is, there are no pages dirty in rest of the range)    
>>>
>>> Sorry, I guess I misinterpreted again.  So the vendor driver can return
>>> copied_pfns < total_pfns if it has a buffer limitation, not as an
>>> indication of its background progress in writing out the bitmap.  Just
>>> as a proof of concept, let's say the vendor driver has a 1 bit buffer
>>> and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
>>> which returns 1, so I read data_offset to find where this 1 bit is
>>> located and then read my bit from that location.  This is the dirty
>>> state of the first pfn.  I read copied_pfns again and it reports 2,  
>>
>> It should report 1 to indicate its data for one pfn.
>>
>>> I again read data_offset to find where the data is located, and it's my
>>> job to remember that I've already read 1 bit, so 2 means there's only 1
>>> bit available and it's the second pfn.  
>>
>> No.
>> Here 'I' means User application, right?
> 
> Yes
> 
>> User application knows for how many pfns bitmap he had already received,
>> i.e. see 'count' in function vfio_get_dirty_page_list().
>>
>> Here copied_pfns is the number of pfns for which bitmap is available in
>> buffer. Start address for that bitmap is then calculated by user
>> application as :
>> ((start_pfn + count) * page_size)
>>
>> Then QEMU calls:
>>
>> cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>>                                        (start_pfn + count) * page_size,
>>                                         copied_pfns);
>>
>>>  I read the bit.  I again read
>>> copied_pfns, which now reports 3, I read data_offset to find the
>>> location of the data, I remember that I've already read 2 bits, so I
>>> read my bit into the 3rd pfn.  This seems rather clumsy.
>>>  
>>
>> Hope above explanation helps.
> 
> Still seems rather clumsy, the knowledge of which bit(s) are available
> in the buffer can only be known by foreknowledge of which bits have
> already been read.  That seems error prone for both the user and the
> vendor driver to stay in sync.
> 
>>> Now that copied_pfns == total_pfns, what happens if I read copied_pfns
>>> again?  This is actually what I thought I was asking previously.
>>>   
>>
>> It should return 0.
> 
> Are we assuming no new pages have been dirtied?  What if pages have
> been dirtied?
> 
>>> Should we expose the pfn buffer size and fault on writes of larger than that
>>> size, requiring the user to iterate start_pfn themselves?  
>>
>> Who should fault, vendor driver or user application?
>>
>> Here Vendor driver is writing data to data section.
>> In the steps in this patch-set, user application is incrementing
>> start_pfn by adding copied_pfn count.
> 
> The user app is writing total_pfns to get a range, correct?  The vendor
> driver could return errno on that write if total_pfns exceeds the
> available buffer size.
> 

ok. If vendor driver returns error, then will user application retry
with smaller size?

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-21  9:22         ` Kirti Wankhede
  2019-06-21 10:45           ` Yan Zhao
@ 2019-06-24 19:00           ` Dr. David Alan Gilbert
  2019-06-26  0:43             ` Yan Zhao
  1 sibling, 1 reply; 64+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-24 19:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, cohuck, shuangtai.tst, qemu-devel, Wang,
	Zhi A, mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Yan Zhao, Liu, Changpeng, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 6/21/2019 2:16 PM, Yan Zhao wrote:
> > On Fri, Jun 21, 2019 at 04:02:50PM +0800, Kirti Wankhede wrote:
> >>
> >>
> >> On 6/21/2019 6:54 AM, Yan Zhao wrote:
> >>> On Fri, Jun 21, 2019 at 08:25:18AM +0800, Yan Zhao wrote:
> >>>> On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
> >>>>> Add migration support for VFIO device
> >>>>>
> >>>>> This Patch set include patches as below:
> >>>>> - Define KABI for VFIO device for migration support.
> >>>>> - Added save and restore functions for PCI configuration space
> >>>>> - Generic migration functionality for VFIO device.
> >>>>>   * This patch set adds functionality only for PCI devices, but can be
> >>>>>     extended to other VFIO devices.
> >>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> >>>>>     resume phases of migration.
> >>>>>   * Added state change notifier and from that notifier function, VFIO
> >>>>>     device's state changed is conveyed to VFIO device driver.
> >>>>>   * During save setup phase and resume/load setup phase, migration region
> >>>>>     is queried and is used to read/write VFIO device data.
> >>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> >>>>>     functionality of iteration during pre-copy phase.
> >>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> >>>>>     iteration to read data from VFIO device driver is implemented till pending
> >>>>>     bytes returned by driver are not zero.
> >>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> >>>>>     driver.
> >>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> >>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> >>>>>   driver, migration is blocked.
> >>>>>
> >>>>> Below is the flow of state change for live migration where states in brackets
> >>>>> represent VM state, migration state and VFIO device state as:
> >>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> >>>>>
> >>>>> Live migration save path:
> >>>>>         QEMU normal running state
> >>>>>         (RUNNING, _NONE, _RUNNING)
> >>>>>                         |
> >>>>>     migrate_init spawns migration_thread.
> >>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> >>>>>     Migration thread then calls each device's .save_setup()
> >>>>>                         |
> >>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> >>>>>     If device is active, get pending bytes by .save_live_pending()
> >>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> >>>>>     Data of VFIO device for pre-copy phase is copied.
> >>>>>     Iterate till pending bytes converge and are less than threshold
> >>>>>                         |
> >>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> >>>>>     for each active device. VFIO device is then transitioned in
> >>>>>      _SAVING state.
> >>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> >>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> >>>>>     pending data is 0.
> >>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> >>>>
> >>>> I suggest we also register to VMStateDescription, whose .pre_save
> >>>> handler would get called after .save_live_complete_precopy in pre-copy
> >>>> only case, and will called before .save_live_iterate in post-copy
> >>>> enabled case.
> >>>> In the .pre_save handler, we can save all device state which must be
> >>>> copied after device stop in source vm and before device start in target vm.
> >>>>
> >>> hi
> >>> to better describe this idea:
> >>>
> >>> in pre-copy only case, the flow is
> >>>
> >>> start migration --> .save_live_iterate (several round) -> stop source vm
> >>> --> .save_live_complete_precopy --> .pre_save  -->start target vm
> >>> -->migration complete
> >>>
> >>>
> >>> in post-copy enabled case, the flow is
> >>>
> >>> start migration --> .save_live_iterate (several round) --> start post copy --> 
> >>> stop source vm --> .pre_save --> start target vm --> .save_live_iterate (several round) 
> >>> -->migration complete
> >>>
> >>> Therefore, we should put saving of device state in .pre_save interface
> >>> rather than in .save_live_complete_precopy. 
> >>> The device state includes pci config data, page tables, register state, etc.
> >>>
> >>> The .save_live_iterate and .save_live_complete_precopy should only deal
> >>> with saving dirty memory.
> >>>
> >>
> >> Vendor driver can decide when to save device state depending on the VFIO
> >> device state set by user. Vendor driver doesn't have to depend on which
> >> callback function QEMU or user application calls. In pre-copy case,
> >> save_live_complete_precopy sets VFIO device state to
> >> VFIO_DEVICE_STATE_SAVING which means vCPUs are stopped and vendor driver
> >> should save all device state.
> >>
> > when post copy stops vCPUs and vfio device, vendor driver only needs to
> > provide device state. but how vendor driver knows that, if no extra
> > interface or no extra device state is provides?
> > 
> 
> .save_live_complete_postcopy interface for post-copy will get called,
> right?

That happens at the very end; I think the question here is for something
that gets called at the point we stop iteratively sending RAM, send the
device states and then start sending RAM on demand to the destination
as it's running. Typically we send a small set of device state
(registers etc) at this point.

I guess there's two different postcopy cases that we need to think
about:
  a) Where the VFIO device doesn't support postcopy - it just gets
  migrated like any other device, so all it's RAM must get sent
  before we flip into postcopy mode.

  b) Where the VFIO device does support postcopy - where the pages
  get sent on demand.

(b) maybe tricky depending on whether your hardware can fault
on pages of your RAM that are needed but not yet transferred;  but
if you can that would make life a lot more practical on really
big VFO devices.

Dave

> Thanks,
> Kirti
> 
> >>>
> >>> I know current implementation does not support post-copy. but at least
> >>> it should not require huge change when we decide to enable it in future.
> >>>
> >>
> >> .has_postcopy and .save_live_complete_postcopy need to be implemented to
> >> support post-copy. I think .save_live_complete_postcopy should be
> >> similar to vfio_save_complete_precopy.
> >>
> >> Thanks,
> >> Kirti
> >>
> >>> Thanks
> >>> Yan
> >>>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-24 18:52                     ` Kirti Wankhede
@ 2019-06-24 19:01                       ` Alex Williamson
  2019-06-25 15:20                         ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Williamson @ 2019-06-24 19:01 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 25 Jun 2019 00:22:16 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/24/2019 8:55 PM, Alex Williamson wrote:
> > On Mon, 24 Jun 2019 20:30:08 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 6/22/2019 3:31 AM, Alex Williamson wrote:  
> >>> On Sat, 22 Jun 2019 02:00:08 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:    
> >>>> On 6/22/2019 1:30 AM, Alex Williamson wrote:    
> >>>>> On Sat, 22 Jun 2019 01:05:48 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>>>> On 6/21/2019 8:33 PM, Alex Williamson wrote:      
> >>>>>>> On Fri, 21 Jun 2019 11:22:15 +0530
> >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>         
> >>>>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:        
> >>>>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
> >>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>>>           
> >>>>>>>>>> - Defined MIGRATION region type and sub-type.
> >>>>>>>>>> - Used 3 bits to define VFIO device states.
> >>>>>>>>>>     Bit 0 => _RUNNING
> >>>>>>>>>>     Bit 1 => _SAVING
> >>>>>>>>>>     Bit 2 => _RESUMING
> >>>>>>>>>>     Combination of these bits defines VFIO device's state during migration
> >>>>>>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
> >>>>>>>>>>     _RUNNING => Normal VFIO device running state.
> >>>>>>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
> >>>>>>>>>>                           saving state of device i.e. pre-copy state
> >>>>>>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> >>>>>>>>>>                           save device state,i.e. stop-n-copy state
> >>>>>>>>>>     _RESUMING => VFIO device resuming state.
> >>>>>>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
> >>>>>>>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>>>>>>>>>   offset of migration region to get/set VFIO device related information.
> >>>>>>>>>>   Defined members of structure and usage on read/write access:
> >>>>>>>>>>     * device_state: (read/write)
> >>>>>>>>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
> >>>>>>>>>>         as of now.
> >>>>>>>>>>     * pending bytes: (read only)
> >>>>>>>>>>         To get pending bytes yet to be migrated for VFIO device.
> >>>>>>>>>>     * data_offset: (read only)
> >>>>>>>>>>         To get data offset in migration from where data exist during _SAVING
> >>>>>>>>>>         and from where data should be written by user space application during
> >>>>>>>>>>          _RESUMING state
> >>>>>>>>>>     * data_size: (read/write)
> >>>>>>>>>>         To get and set size of data copied in migration region during _SAVING
> >>>>>>>>>>         and _RESUMING state.
> >>>>>>>>>>     * start_pfn, page_size, total_pfns: (write only)
> >>>>>>>>>>         To get bitmap of dirty pages from vendor driver from given
> >>>>>>>>>>         start address for total_pfns.
> >>>>>>>>>>     * copied_pfns: (read only)
> >>>>>>>>>>         To get number of pfns bitmap copied in migration region.
> >>>>>>>>>>         Vendor driver should copy the bitmap with bits set only for
> >>>>>>>>>>         pages to be marked dirty in migration region. Vendor driver
> >>>>>>>>>>         should return 0 if there are 0 pages dirty in requested
> >>>>>>>>>>         range. Vendor driver should return -1 to mark all pages in the section
> >>>>>>>>>>         as dirty
> >>>>>>>>>>
> >>>>>>>>>> Migration region looks like:
> >>>>>>>>>>  ------------------------------------------------------------------
> >>>>>>>>>> |vfio_device_migration_info|    data section                      |
> >>>>>>>>>> |                          |     ///////////////////////////////  |
> >>>>>>>>>>  ------------------------------------------------------------------
> >>>>>>>>>>  ^                              ^                              ^
> >>>>>>>>>>  offset 0-trapped part        data_offset                 data_size
> >>>>>>>>>>
> >>>>>>>>>> Data section is always followed by vfio_device_migration_info
> >>>>>>>>>> structure in the region, so data_offset will always be none-0.
> >>>>>>>>>> Offset from where data is copied is decided by kernel driver, data
> >>>>>>>>>> section can be trapped or mapped depending on how kernel driver
> >>>>>>>>>> defines data section. If mmapped, then data_offset should be page
> >>>>>>>>>> aligned, where as initial section which contain
> >>>>>>>>>> vfio_device_migration_info structure might not end at offset which
> >>>>>>>>>> is page aligned.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>>>>>>>> ---
> >>>>>>>>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>>  1 file changed, 71 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>>>>>>>>> index 24f505199f83..274ec477eb82 100644
> >>>>>>>>>> --- a/linux-headers/linux/vfio.h
> >>>>>>>>>> +++ b/linux-headers/linux/vfio.h
> >>>>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
> >>>>>>>>>>   */
> >>>>>>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> >>>>>>>>>>  
> >>>>>>>>>> +/* Migration region type and sub-type */
> >>>>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
> >>>>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> >>>>>>>>>> +
> >>>>>>>>>> +/**
> >>>>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>>>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >>>>>>>>>> + * information. Field accesses from this structure are only supported at their
> >>>>>>>>>> + * native width and alignment, otherwise should return error.
> >>>>>>>>>> + *
> >>>>>>>>>> + * device_state: (read/write)
> >>>>>>>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
> >>>>>>>>>> + *      to. If device state transition fails, write to this field return error.
> >>>>>>>>>> + *      It consists of 3 bits:
> >>>>>>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >>>>>>>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> >>>>>>>>>> + *        device before write returns.
> >>>>>>>>>> + *      - If bit 1 set, indicates _SAVING state.
> >>>>>>>>>> + *      - If bit 2 set, indicates _RESUMING state.
> >>>>>>>>>> + *
> >>>>>>>>>> + * pending bytes: (read only)
> >>>>>>>>>> + *      Read pending bytes yet to be migrated from vendor driver
> >>>>>>>>>> + *
> >>>>>>>>>> + * data_offset: (read only)
> >>>>>>>>>> + *      User application should read data_offset in migration region from where
> >>>>>>>>>> + *      user application should read data during _SAVING state or write data
> >>>>>>>>>> + *      during _RESUMING state.
> >>>>>>>>>> + *
> >>>>>>>>>> + * data_size: (read/write)
> >>>>>>>>>> + *      User application should read data_size to know data copied in migration
> >>>>>>>>>> + *      region during _SAVING state and write size of data copied in migration
> >>>>>>>>>> + *      region during _RESUMING state.
> >>>>>>>>>> + *
> >>>>>>>>>> + * start_pfn: (write only)
> >>>>>>>>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> >>>>>>>>>> + *      _SAVING state.
> >>>>>>>>>> + *
> >>>>>>>>>> + * page_size: (write only)
> >>>>>>>>>> + *      User application should write the page_size of pfn.
> >>>>>>>>>> + *
> >>>>>>>>>> + * total_pfns: (write only)
> >>>>>>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> >>>>>>>>>> + *
> >>>>>>>>>> + * copied_pfns: (read only)
> >>>>>>>>>> + *      pfn count for which dirty bitmap is copied to migration region.
> >>>>>>>>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> >>>>>>>>>> + *      marked dirty in migration region.
> >>>>>>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
> >>>>>>>>>> + *      range.
> >>>>>>>>>> + *      Vendor driver should return -1 to mark all pages in the section as
> >>>>>>>>>> + *      dirty.          
> >>>>>>>>>
> >>>>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
> >>>>>>>>> any order and then the read of copied_pfns is what triggers the
> >>>>>>>>> snapshot?          
> >>>>>>>>
> >>>>>>>> Yes.
> >>>>>>>>        
> >>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>>>>>> re-reading copied_pfns?          
> >>>>>>>>
> >>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>>>>>> start_pfn + tolal_pfns).
> >>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap size
> >>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>>>>>> is, there are no pages dirty in rest of the range)        
> >>>>>>>
> >>>>>>> So reading copied_pfns triggers the data range to be updated, but the
> >>>>>>> caller cannot assume it to be synchronous and uses total_pfns to poll
> >>>>>>> that the update is complete?  How does the vendor driver differentiate
> >>>>>>> the user polling for the previous update to finish versus requesting a
> >>>>>>> new update?
> >>>>>>>         
> >>>>>>
> >>>>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
> >>>>>> indicates new update, where as sequential read on copied_pfns indicates
> >>>>>> polling for previous update.      
> >>>>>
> >>>>> Hmm, this seems to contradict the answer to my question above where I
> >>>>> ask if the write fields are sticky so a user can trigger a refresh via
> >>>>> copied_pfns.      
> >>>>
> >>>> Sorry, how its contradict? pasting it again below:    
> >>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>>>>>> re-reading copied_pfns?      
> >>>>>>>>
> >>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>>>>>> start_pfn + tolal_pfns).
> >>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap      
> >>>> size    
> >>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>>>>>> is, there are no pages dirty in rest of the range)      
> >>>
> >>> Sorry, I guess I misinterpreted again.  So the vendor driver can return
> >>> copied_pfns < total_pfns if it has a buffer limitation, not as an
> >>> indication of its background progress in writing out the bitmap.  Just
> >>> as a proof of concept, let's say the vendor driver has a 1 bit buffer
> >>> and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
> >>> which returns 1, so I read data_offset to find where this 1 bit is
> >>> located and then read my bit from that location.  This is the dirty
> >>> state of the first pfn.  I read copied_pfns again and it reports 2,    
> >>
> >> It should report 1 to indicate its data for one pfn.
> >>  
> >>> I again read data_offset to find where the data is located, and it's my
> >>> job to remember that I've already read 1 bit, so 2 means there's only 1
> >>> bit available and it's the second pfn.    
> >>
> >> No.
> >> Here 'I' means User application, right?  
> > 
> > Yes
> >   
> >> User application knows for how many pfns bitmap he had already received,
> >> i.e. see 'count' in function vfio_get_dirty_page_list().
> >>
> >> Here copied_pfns is the number of pfns for which bitmap is available in
> >> buffer. Start address for that bitmap is then calculated by user
> >> application as :
> >> ((start_pfn + count) * page_size)
> >>
> >> Then QEMU calls:
> >>
> >> cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> >>                                        (start_pfn + count) * page_size,
> >>                                         copied_pfns);
> >>  
> >>>  I read the bit.  I again read
> >>> copied_pfns, which now reports 3, I read data_offset to find the
> >>> location of the data, I remember that I've already read 2 bits, so I
> >>> read my bit into the 3rd pfn.  This seems rather clumsy.
> >>>    
> >>
> >> Hope above explanation helps.  
> > 
> > Still seems rather clumsy, the knowledge of which bit(s) are available
> > in the buffer can only be known by foreknowledge of which bits have
> > already been read.  That seems error prone for both the user and the
> > vendor driver to stay in sync.
> >   
> >>> Now that copied_pfns == total_pfns, what happens if I read copied_pfns
> >>> again?  This is actually what I thought I was asking previously.
> >>>     
> >>
> >> It should return 0.  
> > 
> > Are we assuming no new pages have been dirtied?  What if pages have
> > been dirtied?
> >   
> >>> Should we expose the pfn buffer size and fault on writes of larger than that
> >>> size, requiring the user to iterate start_pfn themselves?    
> >>
> >> Who should fault, vendor driver or user application?
> >>
> >> Here Vendor driver is writing data to data section.
> >> In the steps in this patch-set, user application is incrementing
> >> start_pfn by adding copied_pfn count.  
> > 
> > The user app is writing total_pfns to get a range, correct?  The vendor
> > driver could return errno on that write if total_pfns exceeds the
> > available buffer size.
> >   
> 
> ok. If vendor driver returns error, then will user application retry
> with smaller size?

I think we'd need to improve the header to indicate the available size,
it would seem unreasonable to me to require the user to guess how much
is available.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-21  0:31   ` Yan Zhao
@ 2019-06-25  3:30     ` Yan Zhao
  2019-06-28  8:50       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 64+ messages in thread
From: Yan Zhao @ 2019-06-25  3:30 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pasic, Tian, Kevin, Liu, Yi L, cjia, Ken.Xue, eskultet, Yang,
	Ziye, qemu-devel, Zhengxiao.zx, shuangtai.tst, dgilbert,
	mlevitsk, yulei.zhang, aik, alex.williamson, eauger, cohuck,
	jonathan.davies, felipe, Liu, Changpeng, Wang, Zhi A

On Fri, Jun 21, 2019 at 08:31:53AM +0800, Yan Zhao wrote:
> On Thu, Jun 20, 2019 at 10:37:36PM +0800, Kirti Wankhede wrote:
> > Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> > functions. These functions handles pre-copy and stop-and-copy phase.
> > 
> > In _SAVING|_RUNNING device state or pre-copy phase:
> > - read pending_bytes
> > - read data_offset - indicates kernel driver to write data to staging
> >   buffer which is mmapped.
> > - read data_size - amount of data in bytes written by vendor driver in migration
> >   region.
> > - if data section is trapped, pread() number of bytes in data_size, from
> >   data_offset.
> > - if data section is mmaped, read mmaped buffer of size data_size.
> > - Write data packet to file stream as below:
> > {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> > VFIO_MIG_FLAG_END_OF_STATE }
> > 
> > In _SAVING device state or stop-and-copy phase
> > a. read config space of device and save to migration file stream. This
> >    doesn't need to be from vendor driver. Any other special config state
> >    from driver can be saved as data in following iteration.
> > b. read pending_bytes - indicates kernel driver to write data to staging
> >    buffer which is mmapped.
> > c. read data_size - amount of data in bytes written by vendor driver in
> >    migration region.
> > d. if data section is trapped, pread() from data_offset of size data_size.
> > e. if data section is mmaped, read mmaped buffer of size data_size.
> > f. Write data packet as below:
> >    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> > g. iterate through steps b to f until (pending_bytes > 0)
> > h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> > 
> > .save_live_iterate runs outside the iothread lock in the migration case, which
> > could race with asynchronous call to get dirty page list causing data corruption
> > in mapped migration region. Mutex added here to serial migration buffer read
> > operation.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 212 insertions(+)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index fe0887c27664..0a2f30872316 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> >      return 0;
> >  }
> >  
> > +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> > +{
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    VFIORegion *region = &migration->region.buffer;
> > +    uint64_t data_offset = 0, data_size = 0;
> > +    int ret;
> > +
> > +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > +                                             data_offset));
> > +    if (ret != sizeof(data_offset)) {
> > +        error_report("Failed to get migration buffer data offset %d",
> > +                     ret);
> > +        return -EINVAL;
> > +    }
> > +
> > +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > +                                             data_size));
> > +    if (ret != sizeof(data_size)) {
> > +        error_report("Failed to get migration buffer data size %d",
> > +                     ret);
> > +        return -EINVAL;
> > +    }
> > +
> how big is the data_size ? 
> if this size is too big, it may take too much time and block others.
> 
> > +    if (data_size > 0) {
> > +        void *buf = NULL;
> > +        bool buffer_mmaped = false;
> > +
> > +        if (region->mmaps) {
> > +            int i;
> > +
> > +            for (i = 0; i < region->nr_mmaps; i++) {
> > +                if ((data_offset >= region->mmaps[i].offset) &&
> > +                    (data_offset < region->mmaps[i].offset +
> > +                                   region->mmaps[i].size)) {
> > +                    buf = region->mmaps[i].mmap + (data_offset -
> > +                                                   region->mmaps[i].offset);
> > +                    buffer_mmaped = true;
> > +                    break;
> > +                }
> > +            }
> > +        }
> > +
> > +        if (!buffer_mmaped) {
> > +            buf = g_malloc0(data_size);
> > +            ret = pread(vbasedev->fd, buf, data_size,
> > +                        region->fd_offset + data_offset);
> > +            if (ret != data_size) {
> > +                error_report("Failed to get migration data %d", ret);
> > +                g_free(buf);
> > +                return -EINVAL;
> > +            }
> > +        }
> > +
> > +        qemu_put_be64(f, data_size);
> > +        qemu_put_buffer(f, buf, data_size);
> > +
> > +        if (!buffer_mmaped) {
> > +            g_free(buf);
> > +        }
> > +        migration->pending_bytes -= data_size;
> > +    } else {
> > +        qemu_put_be64(f, data_size);
> > +    }
> > +
> > +    ret = qemu_file_get_error(f);
> > +
> > +    return data_size;
> > +}
> > +
> > +static int vfio_update_pending(VFIODevice *vbasedev)
> > +{
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    VFIORegion *region = &migration->region.buffer;
> > +    uint64_t pending_bytes = 0;
> > +    int ret;
> > +
> > +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > +                                             pending_bytes));
> > +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> > +        error_report("Failed to get pending bytes %d", ret);
> > +        migration->pending_bytes = 0;
> > +        return (ret < 0) ? ret : -EINVAL;
> > +    }
> > +
> > +    migration->pending_bytes = pending_bytes;
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +
> > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> > +
> > +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> > +        vfio_pci_save_config(vbasedev, f);
> > +    }
> > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > +
> > +    return qemu_file_get_error(f);
> > +}
> > +
> >  /* ---------------------------------------------------------------------- */
> >  
> >  static int vfio_save_setup(QEMUFile *f, void *opaque)
> > @@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
> >      }
> >  }
> >  
> > +static void vfio_save_pending(QEMUFile *f, void *opaque,
> > +                              uint64_t threshold_size,
> > +                              uint64_t *res_precopy_only,
> > +                              uint64_t *res_compatible,
> > +                              uint64_t *res_postcopy_only)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    int ret;
> > +
> > +    ret = vfio_update_pending(vbasedev);
> > +    if (ret) {
> > +        return;
> > +    }
> > +
> > +    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> > +        *res_precopy_only += migration->pending_bytes;
> > +    } else {
> > +        *res_postcopy_only += migration->pending_bytes;
> > +    }
by definition,
- res_precopy_only is for data which must be migrated in precopy phase
   or in stopped state, in other words - before target vm start
- res_postcopy_only is for data which must be migrated in postcopy phase
  or in stopped state, in other words - after source vm stop
So, we can only determining data type by the nature of the data. i.e.
if it is device state data which must be copied after source vm stop and
before target vm start, it belongs to res_precopy_only.

It is not right to determining data type by current device state.

Thanks
Yan

> > +    *res_compatible += 0;
> > +}
> > +
> > +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    int ret;
> > +
> > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> > +
> > +    qemu_mutex_lock(&migration->lock);
> > +    ret = vfio_save_buffer(f, vbasedev);
> > +    qemu_mutex_unlock(&migration->lock);
> > +
> > +    if (ret < 0) {
> > +        error_report("vfio_save_buffer failed %s",
> > +                     strerror(errno));
> > +        return ret;
> > +    }
> > +
> > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > +
> > +    ret = qemu_file_get_error(f);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    return ret;
> > +}
> > +
> > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    int ret;
> > +
> > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> > +    if (ret) {
> > +        error_report("Failed to set state STOP and SAVING");
> > +        return ret;
> > +    }
> > +
> > +    ret = vfio_save_device_config_state(f, opaque);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    ret = vfio_update_pending(vbasedev);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    while (migration->pending_bytes > 0) {
> > +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> > +        ret = vfio_save_buffer(f, vbasedev);
> > +        if (ret < 0) {
> > +            error_report("Failed to save buffer");
> > +            return ret;
> > +        } else if (ret == 0) {
> > +            break;
> > +        }
> > +
> > +        ret = vfio_update_pending(vbasedev);
> > +        if (ret) {
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > +
> > +    ret = qemu_file_get_error(f);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
> > +    if (ret) {
> > +        error_report("Failed to set state STOPPED");
> > +        return ret;
> > +    }
> > +    return ret;
> > +}
> > +
> >  static SaveVMHandlers savevm_vfio_handlers = {
> >      .save_setup = vfio_save_setup,
> >      .save_cleanup = vfio_save_cleanup,
> > +    .save_live_pending = vfio_save_pending,
> > +    .save_live_iterate = vfio_save_iterate,
> > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> >  };
> >  
> >  /* ---------------------------------------------------------------------- */
> > -- 
> > 2.7.0
> > 
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 05/13] vfio: Add VM state change handler to know state of VM
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 05/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2019-06-25 10:29   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 64+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-25 10:29 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	yulei.zhang, cohuck, shuangtai.tst, qemu-devel, zhi.a.wang,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> VM state change handler gets called on change in VM's state. This is used to set
> VFIO device state to _RUNNING.
> VM state change handler, migration state change handler and log_sync listener
> are called asynchronously, which sometimes lead to data corruption in migration
> region. Initialised mutex that is used to serialize operations on migration data
> region during saving state.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>


> ---
>  hw/vfio/migration.c           | 45 +++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  4 ++++
>  2 files changed, 49 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ba58d9253d26..15af218c23d1 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -77,6 +77,41 @@ err:
>      return ret;
>  }
>  
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    int ret = 0;
> +
> +    ret = pwrite(vbasedev->fd, &state, sizeof(state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("Failed to set migration state %d %s",
> +                     ret, strerror(errno));

Please include the device name/id in errors; it just makes it easier to
figure out the cause when someone sends me a log.

Other than that;


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> +        return ret;
> +    }
> +
> +    vbasedev->device_state = state;
> +    return 0;
> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running) && running) {
> +        int ret;
> +
> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("Failed to set state RUNNING");
> +        }
> +    }
> +
> +    vbasedev->vm_running = running;
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -91,6 +126,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    qemu_mutex_init(&vbasedev->migration->lock);
> +
> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
> +
>      return 0;
>  }
>  
> @@ -127,11 +167,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>          return;
>      }
>  
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +    }
> +
>      if (vbasedev->migration_blocker) {
>          migrate_del_blocker(vbasedev->migration_blocker);
>          error_free(vbasedev->migration_blocker);
>      }
>  
> +    qemu_mutex_destroy(&vbasedev->migration->lock);
>      vfio_migration_region_exit(vbasedev);
>      g_free(vbasedev->migration);
>  }
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1374a03470d8..f2392e97fa57 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -29,6 +29,7 @@
>  #ifdef CONFIG_LINUX
>  #include <linux/vfio.h>
>  #endif
> +#include "sysemu/sysemu.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -129,6 +130,9 @@ typedef struct VFIODevice {
>      unsigned int flags;
>      VFIOMigration *migration;
>      Error *migration_blocker;
> +    uint32_t device_state;
> +    VMChangeStateEntry *vm_state;
> +    int vm_running;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
  2019-06-24 19:01                       ` Alex Williamson
@ 2019-06-25 15:20                         ` Kirti Wankhede
  0 siblings, 0 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-25 15:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	eauger, cohuck, jonathan.davies, felipe, mlevitsk, changpeng.liu,
	zhi.a.wang, yan.y.zhao



On 6/25/2019 12:31 AM, Alex Williamson wrote:
> On Tue, 25 Jun 2019 00:22:16 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/24/2019 8:55 PM, Alex Williamson wrote:
>>> On Mon, 24 Jun 2019 20:30:08 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 6/22/2019 3:31 AM, Alex Williamson wrote:  
>>>>> On Sat, 22 Jun 2019 02:00:08 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:    
>>>>>> On 6/22/2019 1:30 AM, Alex Williamson wrote:    
>>>>>>> On Sat, 22 Jun 2019 01:05:48 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>       
>>>>>>>> On 6/21/2019 8:33 PM, Alex Williamson wrote:      
>>>>>>>>> On Fri, 21 Jun 2019 11:22:15 +0530
>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>         
>>>>>>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:        
>>>>>>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
>>>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>>>           
>>>>>>>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>>>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>>>>>>>     Bit 0 => _RUNNING
>>>>>>>>>>>>     Bit 1 => _SAVING
>>>>>>>>>>>>     Bit 2 => _RESUMING
>>>>>>>>>>>>     Combination of these bits defines VFIO device's state during migration
>>>>>>>>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
>>>>>>>>>>>>     _RUNNING => Normal VFIO device running state.
>>>>>>>>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>>>>>>>>>>>>                           saving state of device i.e. pre-copy state
>>>>>>>>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>>>>>>>>>>>>                           save device state,i.e. stop-n-copy state
>>>>>>>>>>>>     _RESUMING => VFIO device resuming state.
>>>>>>>>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>>>>>>>>>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>>>>>>>>>>>   offset of migration region to get/set VFIO device related information.
>>>>>>>>>>>>   Defined members of structure and usage on read/write access:
>>>>>>>>>>>>     * device_state: (read/write)
>>>>>>>>>>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>>>>>>>>>>>>         as of now.
>>>>>>>>>>>>     * pending bytes: (read only)
>>>>>>>>>>>>         To get pending bytes yet to be migrated for VFIO device.
>>>>>>>>>>>>     * data_offset: (read only)
>>>>>>>>>>>>         To get data offset in migration from where data exist during _SAVING
>>>>>>>>>>>>         and from where data should be written by user space application during
>>>>>>>>>>>>          _RESUMING state
>>>>>>>>>>>>     * data_size: (read/write)
>>>>>>>>>>>>         To get and set size of data copied in migration region during _SAVING
>>>>>>>>>>>>         and _RESUMING state.
>>>>>>>>>>>>     * start_pfn, page_size, total_pfns: (write only)
>>>>>>>>>>>>         To get bitmap of dirty pages from vendor driver from given
>>>>>>>>>>>>         start address for total_pfns.
>>>>>>>>>>>>     * copied_pfns: (read only)
>>>>>>>>>>>>         To get number of pfns bitmap copied in migration region.
>>>>>>>>>>>>         Vendor driver should copy the bitmap with bits set only for
>>>>>>>>>>>>         pages to be marked dirty in migration region. Vendor driver
>>>>>>>>>>>>         should return 0 if there are 0 pages dirty in requested
>>>>>>>>>>>>         range. Vendor driver should return -1 to mark all pages in the section
>>>>>>>>>>>>         as dirty
>>>>>>>>>>>>
>>>>>>>>>>>> Migration region looks like:
>>>>>>>>>>>>  ------------------------------------------------------------------
>>>>>>>>>>>> |vfio_device_migration_info|    data section                      |
>>>>>>>>>>>> |                          |     ///////////////////////////////  |
>>>>>>>>>>>>  ------------------------------------------------------------------
>>>>>>>>>>>>  ^                              ^                              ^
>>>>>>>>>>>>  offset 0-trapped part        data_offset                 data_size
>>>>>>>>>>>>
>>>>>>>>>>>> Data section is always followed by vfio_device_migration_info
>>>>>>>>>>>> structure in the region, so data_offset will always be none-0.
>>>>>>>>>>>> Offset from where data is copied is decided by kernel driver, data
>>>>>>>>>>>> section can be trapped or mapped depending on how kernel driver
>>>>>>>>>>>> defines data section. If mmapped, then data_offset should be page
>>>>>>>>>>>> aligned, where as initial section which contain
>>>>>>>>>>>> vfio_device_migration_info structure might not end at offset which
>>>>>>>>>>>> is page aligned.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>  1 file changed, 71 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>>>>>>>> index 24f505199f83..274ec477eb82 100644
>>>>>>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>>>>>>>>>>>>   */
>>>>>>>>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>>>>>>>>>>>>  
>>>>>>>>>>>> +/* Migration region type and sub-type */
>>>>>>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
>>>>>>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>>>>>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>>>>>>>>>>> + * information. Field accesses from this structure are only supported at their
>>>>>>>>>>>> + * native width and alignment, otherwise should return error.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * device_state: (read/write)
>>>>>>>>>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>>>>>>>>>>> + *      to. If device state transition fails, write to this field return error.
>>>>>>>>>>>> + *      It consists of 3 bits:
>>>>>>>>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>>>>>>>>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>>>>>>>>>>>> + *        device before write returns.
>>>>>>>>>>>> + *      - If bit 1 set, indicates _SAVING state.
>>>>>>>>>>>> + *      - If bit 2 set, indicates _RESUMING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * pending bytes: (read only)
>>>>>>>>>>>> + *      Read pending bytes yet to be migrated from vendor driver
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * data_offset: (read only)
>>>>>>>>>>>> + *      User application should read data_offset in migration region from where
>>>>>>>>>>>> + *      user application should read data during _SAVING state or write data
>>>>>>>>>>>> + *      during _RESUMING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * data_size: (read/write)
>>>>>>>>>>>> + *      User application should read data_size to know data copied in migration
>>>>>>>>>>>> + *      region during _SAVING state and write size of data copied in migration
>>>>>>>>>>>> + *      region during _RESUMING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * start_pfn: (write only)
>>>>>>>>>>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
>>>>>>>>>>>> + *      _SAVING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * page_size: (write only)
>>>>>>>>>>>> + *      User application should write the page_size of pfn.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * total_pfns: (write only)
>>>>>>>>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * copied_pfns: (read only)
>>>>>>>>>>>> + *      pfn count for which dirty bitmap is copied to migration region.
>>>>>>>>>>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
>>>>>>>>>>>> + *      marked dirty in migration region.
>>>>>>>>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
>>>>>>>>>>>> + *      range.
>>>>>>>>>>>> + *      Vendor driver should return -1 to mark all pages in the section as
>>>>>>>>>>>> + *      dirty.          
>>>>>>>>>>>
>>>>>>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
>>>>>>>>>>> any order and then the read of copied_pfns is what triggers the
>>>>>>>>>>> snapshot?          
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>        
>>>>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>>>>>>>> re-reading copied_pfns?          
>>>>>>>>>>
>>>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap size
>>>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>>>> is, there are no pages dirty in rest of the range)        
>>>>>>>>>
>>>>>>>>> So reading copied_pfns triggers the data range to be updated, but the
>>>>>>>>> caller cannot assume it to be synchronous and uses total_pfns to poll
>>>>>>>>> that the update is complete?  How does the vendor driver differentiate
>>>>>>>>> the user polling for the previous update to finish versus requesting a
>>>>>>>>> new update?
>>>>>>>>>         
>>>>>>>>
>>>>>>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
>>>>>>>> indicates new update, where as sequential read on copied_pfns indicates
>>>>>>>> polling for previous update.      
>>>>>>>
>>>>>>> Hmm, this seems to contradict the answer to my question above where I
>>>>>>> ask if the write fields are sticky so a user can trigger a refresh via
>>>>>>> copied_pfns.      
>>>>>>
>>>>>> Sorry, how its contradict? pasting it again below:    
>>>>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>>>>>>>> re-reading copied_pfns?      
>>>>>>>>>>
>>>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap      
>>>>>> size    
>>>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>>>> is, there are no pages dirty in rest of the range)      
>>>>>
>>>>> Sorry, I guess I misinterpreted again.  So the vendor driver can return
>>>>> copied_pfns < total_pfns if it has a buffer limitation, not as an
>>>>> indication of its background progress in writing out the bitmap.  Just
>>>>> as a proof of concept, let's say the vendor driver has a 1 bit buffer
>>>>> and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
>>>>> which returns 1, so I read data_offset to find where this 1 bit is
>>>>> located and then read my bit from that location.  This is the dirty
>>>>> state of the first pfn.  I read copied_pfns again and it reports 2,    
>>>>
>>>> It should report 1 to indicate its data for one pfn.
>>>>  
>>>>> I again read data_offset to find where the data is located, and it's my
>>>>> job to remember that I've already read 1 bit, so 2 means there's only 1
>>>>> bit available and it's the second pfn.    
>>>>
>>>> No.
>>>> Here 'I' means User application, right?  
>>>
>>> Yes
>>>   
>>>> User application knows for how many pfns bitmap he had already received,
>>>> i.e. see 'count' in function vfio_get_dirty_page_list().
>>>>
>>>> Here copied_pfns is the number of pfns for which bitmap is available in
>>>> buffer. Start address for that bitmap is then calculated by user
>>>> application as :
>>>> ((start_pfn + count) * page_size)
>>>>
>>>> Then QEMU calls:
>>>>
>>>> cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>>>>                                        (start_pfn + count) * page_size,
>>>>                                         copied_pfns);
>>>>  
>>>>>  I read the bit.  I again read
>>>>> copied_pfns, which now reports 3, I read data_offset to find the
>>>>> location of the data, I remember that I've already read 2 bits, so I
>>>>> read my bit into the 3rd pfn.  This seems rather clumsy.
>>>>>    
>>>>
>>>> Hope above explanation helps.  
>>>
>>> Still seems rather clumsy, the knowledge of which bit(s) are available
>>> in the buffer can only be known by foreknowledge of which bits have
>>> already been read.  That seems error prone for both the user and the
>>> vendor driver to stay in sync.
>>>   
>>>>> Now that copied_pfns == total_pfns, what happens if I read copied_pfns
>>>>> again?  This is actually what I thought I was asking previously.
>>>>>     
>>>>
>>>> It should return 0.  
>>>
>>> Are we assuming no new pages have been dirtied?  What if pages have
>>> been dirtied?
>>>   
>>>>> Should we expose the pfn buffer size and fault on writes of larger than that
>>>>> size, requiring the user to iterate start_pfn themselves?    
>>>>
>>>> Who should fault, vendor driver or user application?
>>>>
>>>> Here Vendor driver is writing data to data section.
>>>> In the steps in this patch-set, user application is incrementing
>>>> start_pfn by adding copied_pfn count.  
>>>
>>> The user app is writing total_pfns to get a range, correct?  The vendor
>>> driver could return errno on that write if total_pfns exceeds the
>>> available buffer size.
>>>   
>>
>> ok. If vendor driver returns error, then will user application retry
>> with smaller size?
> 
> I think we'd need to improve the header to indicate the available size,
> it would seem unreasonable to me to require the user to guess how much
> is available.  Thanks,
> 

Instead of returning error on write to total_pfns, how about writing
updated start_pfn and total_pfns again as below:

count = 0
while (total_pfns > 0) {
    write(start_pfn + count)
    write(page_size)
    write(total_pfns)
    read(copied_pfns)
    read(data_offset)
    read bitmap from data_offset for copied_pfns and mark pages dirty
    if (copied_pfns < total_pfns)
        count += copied_pfns,
        total_pfns -= copied_pfns
}

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/13] vfio: Add function to get dirty page list
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 10/13] vfio: Add function to get dirty page list Kirti Wankhede
@ 2019-06-26  0:40   ` Yan Zhao
  0 siblings, 0 replies; 64+ messages in thread
From: Yan Zhao @ 2019-06-26  0:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	yulei.zhang, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, qemu-devel,
	felipe, jonathan.davies, Liu, Changpeng, Ken.Xue

On Thu, Jun 20, 2019 at 10:37:38PM +0800, Kirti Wankhede wrote:
> Dirty page tracking (.log_sync) is part of RAM copying state, where
> vendor driver provides the bitmap of pages which are dirtied by vendor
> driver through migration region and as part of RAM copy, those pages
> gets copied to file stream.
> 
> To get dirty page bitmap:
> - write start address, page_size and pfn count.
> - read count of pfns copied.
>     - Vendor driver should return 0 if driver doesn't have any page to
>       report dirty in given range.
>     - Vendor driver should return -1 to mark all pages dirty for given range.
> - read data_offset, where vendor driver has written bitmap.
> - read bitmap from the region or mmaped part of the region. This copy is
>   iterated till page bitmap for all requested pfns are copied.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 119 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   2 +
>  2 files changed, 121 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e4895f91761d..68775b5dec11 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -228,6 +228,125 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> +                              uint64_t start_pfn,
> +                              uint64_t pfn_count,
> +                              uint64_t page_size)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t count = 0;
> +    int64_t copied_pfns = 0;
> +    int ret;
> +
> +    qemu_mutex_lock(&migration->lock);
> +    ret = pwrite(vbasedev->fd, &start_pfn, sizeof(start_pfn),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              start_pfn));
> +    if (ret < 0) {
> +        error_report("Failed to set dirty pages start address %d %s",
> +                ret, strerror(errno));
> +        goto dpl_unlock;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              page_size));
> +    if (ret < 0) {
> +        error_report("Failed to set dirty page size %d %s",
> +                ret, strerror(errno));
> +        goto dpl_unlock;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &pfn_count, sizeof(pfn_count),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              total_pfns));
> +    if (ret < 0) {
> +        error_report("Failed to set dirty page total pfns %d %s",
> +                ret, strerror(errno));
> +        goto dpl_unlock;
> +    }
> +
> +    do {
> +        uint64_t bitmap_size, data_offset = 0;
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        /* Read copied dirty pfns */
> +        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             copied_pfns));
> +        if (ret < 0) {
> +            error_report("Failed to get dirty pages bitmap count %d %s",
> +                    ret, strerror(errno));
> +            goto dpl_unlock;
> +        }
> +
> +        if (copied_pfns == 0) {
> +            /*
> +             * copied_pfns could be 0 if driver doesn't have any page to
> +             * report dirty in given range
> +             */
> +            break;
this copied_pfn is the dirty page count in which range?
if it is got each iteration, why break here rather than continue ?
consider there's a big region with pfn_count, and it is now breaked into
several smaller subregions, and copied_pfns is 0 in the first subregion,
it doesn't mean copied_pfns are all 0 in the remaining subregions.

> +        } else if (copied_pfns == -1) {
> +            /* Mark all pages dirty for this range */
> +            cpu_physical_memory_set_dirty_range(start_pfn * page_size,
> +                                                pfn_count * page_size,
> +                                                DIRTY_MEMORY_MIGRATION);
> +            break;
> +        }
> +
> +        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
> +
> +        ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +        if (ret != sizeof(data_offset)) {
> +            error_report("Failed to get migration buffer data offset %d",
> +                         ret);
> +            goto dpl_unlock;
> +        }
> +
> +        if (region->mmaps) {
> +            int i;
> +            for (i = 0; i < region->nr_mmaps; i++) {
> +                if ((region->mmaps[i].offset >= data_offset) &&
> +                    (data_offset < region->mmaps[i].offset +
> +                                   region->mmaps[i].size)) {
> +                    buf = region->mmaps[i].mmap + (data_offset -
> +                                                   region->mmaps[i].offset);
> +                    buffer_mmaped = true;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        if (!buffer_mmaped) {
> +            buf = g_malloc0(bitmap_size);
> +
> +            ret = pread(vbasedev->fd, buf, bitmap_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != bitmap_size) {
> +                error_report("Failed to get dirty pages bitmap %d", ret);
> +                g_free(buf);
> +                goto dpl_unlock;
> +            }
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> +                                               (start_pfn + count) * page_size,
> +                                                copied_pfns);
> +        count +=  copied_pfns;
> +
here also. why it is count += copied_pfns.

> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +    } while (count < pfn_count);
> +
> +dpl_unlock:
> +    qemu_mutex_unlock(&migration->lock);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1d26e6be8d48..423d6dbccace 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -224,5 +224,7 @@ int vfio_spapr_remove_window(VFIOContainer *container,
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
>  void vfio_migration_finalize(VFIODevice *vbasedev);
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
> +                               uint64_t pfn_count, uint64_t page_size);
>  
>  #endif /* HW_VFIO_VFIO_COMMON_H */
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-24 19:00           ` Dr. David Alan Gilbert
@ 2019-06-26  0:43             ` Yan Zhao
  2019-06-28  9:44               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 64+ messages in thread
From: Yan Zhao @ 2019-06-26  0:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	alex.williamson, Wang, Zhi A, mlevitsk, pasic, aik,
	Kirti Wankhede, eauger, felipe, jonathan.davies, Liu, Changpeng,
	Ken.Xue

On Tue, Jun 25, 2019 at 03:00:24AM +0800, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > 
> > 
> > On 6/21/2019 2:16 PM, Yan Zhao wrote:
> > > On Fri, Jun 21, 2019 at 04:02:50PM +0800, Kirti Wankhede wrote:
> > >>
> > >>
> > >> On 6/21/2019 6:54 AM, Yan Zhao wrote:
> > >>> On Fri, Jun 21, 2019 at 08:25:18AM +0800, Yan Zhao wrote:
> > >>>> On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
> > >>>>> Add migration support for VFIO device
> > >>>>>
> > >>>>> This Patch set include patches as below:
> > >>>>> - Define KABI for VFIO device for migration support.
> > >>>>> - Added save and restore functions for PCI configuration space
> > >>>>> - Generic migration functionality for VFIO device.
> > >>>>>   * This patch set adds functionality only for PCI devices, but can be
> > >>>>>     extended to other VFIO devices.
> > >>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> > >>>>>     resume phases of migration.
> > >>>>>   * Added state change notifier and from that notifier function, VFIO
> > >>>>>     device's state changed is conveyed to VFIO device driver.
> > >>>>>   * During save setup phase and resume/load setup phase, migration region
> > >>>>>     is queried and is used to read/write VFIO device data.
> > >>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> > >>>>>     functionality of iteration during pre-copy phase.
> > >>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> > >>>>>     iteration to read data from VFIO device driver is implemented till pending
> > >>>>>     bytes returned by driver are not zero.
> > >>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> > >>>>>     driver.
> > >>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> > >>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> > >>>>>   driver, migration is blocked.
> > >>>>>
> > >>>>> Below is the flow of state change for live migration where states in brackets
> > >>>>> represent VM state, migration state and VFIO device state as:
> > >>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > >>>>>
> > >>>>> Live migration save path:
> > >>>>>         QEMU normal running state
> > >>>>>         (RUNNING, _NONE, _RUNNING)
> > >>>>>                         |
> > >>>>>     migrate_init spawns migration_thread.
> > >>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> > >>>>>     Migration thread then calls each device's .save_setup()
> > >>>>>                         |
> > >>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> > >>>>>     If device is active, get pending bytes by .save_live_pending()
> > >>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> > >>>>>     Data of VFIO device for pre-copy phase is copied.
> > >>>>>     Iterate till pending bytes converge and are less than threshold
> > >>>>>                         |
> > >>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> > >>>>>     for each active device. VFIO device is then transitioned in
> > >>>>>      _SAVING state.
> > >>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> > >>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> > >>>>>     pending data is 0.
> > >>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> > >>>>
> > >>>> I suggest we also register to VMStateDescription, whose .pre_save
> > >>>> handler would get called after .save_live_complete_precopy in pre-copy
> > >>>> only case, and will called before .save_live_iterate in post-copy
> > >>>> enabled case.
> > >>>> In the .pre_save handler, we can save all device state which must be
> > >>>> copied after device stop in source vm and before device start in target vm.
> > >>>>
> > >>> hi
> > >>> to better describe this idea:
> > >>>
> > >>> in pre-copy only case, the flow is
> > >>>
> > >>> start migration --> .save_live_iterate (several round) -> stop source vm
> > >>> --> .save_live_complete_precopy --> .pre_save  -->start target vm
> > >>> -->migration complete
> > >>>
> > >>>
> > >>> in post-copy enabled case, the flow is
> > >>>
> > >>> start migration --> .save_live_iterate (several round) --> start post copy --> 
> > >>> stop source vm --> .pre_save --> start target vm --> .save_live_iterate (several round) 
> > >>> -->migration complete
> > >>>
> > >>> Therefore, we should put saving of device state in .pre_save interface
> > >>> rather than in .save_live_complete_precopy. 
> > >>> The device state includes pci config data, page tables, register state, etc.
> > >>>
> > >>> The .save_live_iterate and .save_live_complete_precopy should only deal
> > >>> with saving dirty memory.
> > >>>
> > >>
> > >> Vendor driver can decide when to save device state depending on the VFIO
> > >> device state set by user. Vendor driver doesn't have to depend on which
> > >> callback function QEMU or user application calls. In pre-copy case,
> > >> save_live_complete_precopy sets VFIO device state to
> > >> VFIO_DEVICE_STATE_SAVING which means vCPUs are stopped and vendor driver
> > >> should save all device state.
> > >>
> > > when post copy stops vCPUs and vfio device, vendor driver only needs to
> > > provide device state. but how vendor driver knows that, if no extra
> > > interface or no extra device state is provides?
> > > 
> > 
> > .save_live_complete_postcopy interface for post-copy will get called,
> > right?
> 
> That happens at the very end; I think the question here is for something
> that gets called at the point we stop iteratively sending RAM, send the
> device states and then start sending RAM on demand to the destination
> as it's running. Typically we send a small set of device state
> (registers etc) at this point.
> 
> I guess there's two different postcopy cases that we need to think
> about:
>   a) Where the VFIO device doesn't support postcopy - it just gets
>   migrated like any other device, so all it's RAM must get sent
>   before we flip into postcopy mode.
> 
>   b) Where the VFIO device does support postcopy - where the pages
>   get sent on demand.
> 
> (b) maybe tricky depending on whether your hardware can fault
> on pages of your RAM that are needed but not yet transferred;  but
> if you can that would make life a lot more practical on really
> big VFO devices.
> 
> Dave
>
hi Dave,
so do you think it is good to abstract device state data and save it in
.pre_save callback?

Thanks
Yan

> > Thanks,
> > Kirti
> > 
> > >>>
> > >>> I know current implementation does not support post-copy. but at least
> > >>> it should not require huge change when we decide to enable it in future.
> > >>>
> > >>
> > >> .has_postcopy and .save_live_complete_postcopy need to be implemented to
> > >> support post-copy. I think .save_live_complete_postcopy should be
> > >> similar to vfio_save_complete_precopy.
> > >>
> > >> Thanks,
> > >> Kirti
> > >>
> > >>> Thanks
> > >>> Yan
> > >>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/13] vfio: Register SaveVMHandlers for VFIO device
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 07/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2019-06-27 10:01   ` Dr. David Alan Gilbert
  2019-06-27 14:31     ` Kirti Wankhede
  0 siblings, 1 reply; 64+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-27 10:01 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	yulei.zhang, cohuck, shuangtai.tst, qemu-devel, zhi.a.wang,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 7f9858e6c995..fe0887c27664 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -22,6 +22,17 @@
>  #include "exec/ram_addr.h"
>  #include "pci.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -96,6 +107,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.buffer.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region.buffer);
> +        qemu_mutex_unlock_iothread();
> +        if (ret) {
> +            error_report("Failed to mmap VFIO migration region %d: %s",
> +                         migration->region.index, strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    if (vbasedev->vm_running) {
> +        ret = vfio_migration_set_state(vbasedev,
> +                         VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
> +        if (ret) {
> +            error_report("Failed to set state RUNNING and SAVING");
> +            return ret;
> +        }
> +    } else {
> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> +        if (ret) {
> +            error_report("Failed to set state STOP and SAVING");
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.buffer.mmaps) {
> +        vfio_region_unmap(&migration->region.buffer);
> +    }
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -169,7 +243,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>      }
>  
>      qemu_mutex_init(&vbasedev->migration->lock);
> -
> +    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);

Does this work OK with multiple devices?
I think I'd expected you to pass a DeviceState as the first parameter
for a real device like vfio.
'ram' and 'block' don't need to because they iterate over all RAM
devices inside their save_setup's and similar handlers;  for vfio I'd
expect it to be per-device.

Dave

>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>  
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 06/13] vfio: Add migration state change notifier
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 06/13] vfio: Add migration state change notifier Kirti Wankhede
@ 2019-06-27 10:33   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 64+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-27 10:33 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	yulei.zhang, cohuck, shuangtai.tst, qemu-devel, zhi.a.wang,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Added migration state change notifier to get notification on migration state
> change. These states are translated to VFIO device state and conveyed to vendor
> driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>

Just a general thought; be careful when following migration states;
For example 'device' and 'pre_switchover' - just be careful to think
about all the different states and how you might get a 'failed'
anywhere.

Dave
> ---
>  hw/vfio/migration.c           | 49 +++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  1 +
>  2 files changed, 50 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 15af218c23d1..7f9858e6c995 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -112,6 +112,48 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>      vbasedev->vm_running = running;
>  }
>  
> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_ACTIVE:
> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> +            if (vbasedev->vm_running) {
> +                ret = vfio_migration_set_state(vbasedev,
> +                          VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
> +                if (ret) {
> +                    error_report("Failed to set state RUNNING and SAVING");
> +                }
> +            } else {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                               VFIO_DEVICE_STATE_SAVING);
> +                if (ret) {
> +                    error_report("Failed to set state STOP and SAVING");
> +                }
> +            }
> +        } else {
> +            ret = vfio_migration_set_state(vbasedev,
> +                                           VFIO_DEVICE_STATE_RESUMING);
> +            if (ret) {
> +                error_report("Failed to set state RESUMING");
> +            }
> +        }
> +        return;
> +
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("Failed to set state RUNNING");
> +        }
> +        return;
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -131,6 +173,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>  
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> +
>      return 0;
>  }
>  
> @@ -167,6 +212,10 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>          return;
>      }
>  
> +    if (vbasedev->migration_state.notify) {
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
>      if (vbasedev->vm_state) {
>          qemu_del_vm_change_state_handler(vbasedev->vm_state);
>      }
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index f2392e97fa57..1d26e6be8d48 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -133,6 +133,7 @@ typedef struct VFIODevice {
>      uint32_t device_state;
>      VMChangeStateEntry *vm_state;
>      int vm_running;
> +    Notifier migration_state;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/13] vfio: Register SaveVMHandlers for VFIO device
  2019-06-27 10:01   ` Dr. David Alan Gilbert
@ 2019-06-27 14:31     ` Kirti Wankhede
  0 siblings, 0 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-27 14:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	yulei.zhang, cohuck, shuangtai.tst, qemu-devel, zhi.a.wang,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue



On 6/27/2019 3:31 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> Define flags to be used as delimeter in migration file stream.
>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>> region from these functions at source during saving or pre-copy phase.
>> Set VFIO device state depending on VM's state. During live migration, VM is
>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 75 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 7f9858e6c995..fe0887c27664 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -22,6 +22,17 @@
>>  #include "exec/ram_addr.h"
>>  #include "pci.h"
>>  
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>> +
>>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>>  {
>>      VFIOMigration *migration = vbasedev->migration;
>> @@ -96,6 +107,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>      return 0;
>>  }
>>  
>> +/* ---------------------------------------------------------------------- */
>> +
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +
>> +    if (migration->region.buffer.mmaps) {
>> +        qemu_mutex_lock_iothread();
>> +        ret = vfio_region_mmap(&migration->region.buffer);
>> +        qemu_mutex_unlock_iothread();
>> +        if (ret) {
>> +            error_report("Failed to mmap VFIO migration region %d: %s",
>> +                         migration->region.index, strerror(-ret));
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    if (vbasedev->vm_running) {
>> +        ret = vfio_migration_set_state(vbasedev,
>> +                         VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
>> +        if (ret) {
>> +            error_report("Failed to set state RUNNING and SAVING");
>> +            return ret;
>> +        }
>> +    } else {
>> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
>> +        if (ret) {
>> +            error_report("Failed to set state STOP and SAVING");
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->region.buffer.mmaps) {
>> +        vfio_region_unmap(&migration->region.buffer);
>> +    }
>> +}
>> +
>> +static SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +};
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>  {
>>      VFIODevice *vbasedev = opaque;
>> @@ -169,7 +243,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>      }
>>  
>>      qemu_mutex_init(&vbasedev->migration->lock);
>> -
>> +    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
> 
> Does this work OK with multiple devices?

Yes. Tested with multiple vGPU devices.

> I think I'd expected you to pass a DeviceState as the first parameter
> for a real device like vfio.
> 'ram' and 'block' don't need to because they iterate over all RAM
> devices inside their save_setup's and similar handlers;  for vfio I'd
> expect it to be per-device.

I do see handlers called per-device. I'll check passing DeviceState as
first parameter.

Thanks,
Kirti

> 
> Dave
> 
>>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                            vbasedev);
>>  
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/13] vfio: Add migration region initialization and finalize function
  2019-06-24 14:00   ` Cornelia Huck
@ 2019-06-27 14:56     ` Kirti Wankhede
  0 siblings, 0 replies; 64+ messages in thread
From: Kirti Wankhede @ 2019-06-27 14:56 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, qemu-devel,
	Zhengxiao.zx, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 6/24/2019 7:30 PM, Cornelia Huck wrote:
> On Thu, 20 Jun 2019 20:07:32 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
>>   patch series.
>> - VFIO device supports migration or not is decided based of migration region
>>   query. If migration region query is successful then migration is supported
>>   else migration is blocked.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/Makefile.objs         |   2 +-
>>  hw/vfio/migration.c           | 137 ++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |  14 +++++
>>  3 files changed, 152 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/vfio/migration.c
> 
> (...)
> 
>> +static int vfio_migration_region_init(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    Object *obj = NULL;
>> +    int ret = -EINVAL;
>> +
>> +    if (!migration) {
>> +        return ret;
>> +    }
>> +
>> +    /* Migration support added for PCI device only */
>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> +        obj = vfio_pci_get_object(vbasedev);
>> +    }
> 
> Hm... what about instead including an (optional) callback in
> VFIODeviceOps that returns the object embedding the VFIODevice? No need
> to adapt this code if we introduce support for a non-pci device, and the
> callback function also allows to support migration in a more
> finegrained way than by device type.
> 

That's good suggestion. I'm incorporating this change in next version.

Thanks,
Kirti


>> +
>> +    if (!obj) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
>> +                            migration->region.index, "migration");
>> +    if (ret) {
>> +        error_report("Failed to setup VFIO migration region %d: %s",
>> +                      migration->region.index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    if (!migration->region.buffer.size) {
>> +        ret = -EINVAL;
>> +        error_report("Invalid region size of VFIO migration region %d: %s",
>> +                     migration->region.index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    return 0;
>> +
>> +err:
>> +    vfio_migration_region_exit(vbasedev);
>> +    return ret;
>> +}
> 
> (...)
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-25  3:30     ` Yan Zhao
@ 2019-06-28  8:50       ` Dr. David Alan Gilbert
  2019-06-28 21:16         ` Yan Zhao
  0 siblings, 1 reply; 64+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-28  8:50 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pasic, Tian, Kevin, Liu, Yi L, cjia, Ken.Xue, eskultet, Yang,
	Ziye, qemu-devel, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	alex.williamson, mlevitsk, yulei.zhang, aik, Kirti Wankhede,
	eauger, cohuck, jonathan.davies, felipe, Liu, Changpeng, Wang,
	Zhi A

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> On Fri, Jun 21, 2019 at 08:31:53AM +0800, Yan Zhao wrote:
> > On Thu, Jun 20, 2019 at 10:37:36PM +0800, Kirti Wankhede wrote:
> > > Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> > > functions. These functions handles pre-copy and stop-and-copy phase.
> > > 
> > > In _SAVING|_RUNNING device state or pre-copy phase:
> > > - read pending_bytes
> > > - read data_offset - indicates kernel driver to write data to staging
> > >   buffer which is mmapped.
> > > - read data_size - amount of data in bytes written by vendor driver in migration
> > >   region.
> > > - if data section is trapped, pread() number of bytes in data_size, from
> > >   data_offset.
> > > - if data section is mmaped, read mmaped buffer of size data_size.
> > > - Write data packet to file stream as below:
> > > {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> > > VFIO_MIG_FLAG_END_OF_STATE }
> > > 
> > > In _SAVING device state or stop-and-copy phase
> > > a. read config space of device and save to migration file stream. This
> > >    doesn't need to be from vendor driver. Any other special config state
> > >    from driver can be saved as data in following iteration.
> > > b. read pending_bytes - indicates kernel driver to write data to staging
> > >    buffer which is mmapped.
> > > c. read data_size - amount of data in bytes written by vendor driver in
> > >    migration region.
> > > d. if data section is trapped, pread() from data_offset of size data_size.
> > > e. if data section is mmaped, read mmaped buffer of size data_size.
> > > f. Write data packet as below:
> > >    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> > > g. iterate through steps b to f until (pending_bytes > 0)
> > > h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> > > 
> > > .save_live_iterate runs outside the iothread lock in the migration case, which
> > > could race with asynchronous call to get dirty page list causing data corruption
> > > in mapped migration region. Mutex added here to serial migration buffer read
> > > operation.
> > > 
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > ---
> > >  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 212 insertions(+)
> > > 
> > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > index fe0887c27664..0a2f30872316 100644
> > > --- a/hw/vfio/migration.c
> > > +++ b/hw/vfio/migration.c
> > > @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> > >      return 0;
> > >  }
> > >  
> > > +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> > > +{
> > > +    VFIOMigration *migration = vbasedev->migration;
> > > +    VFIORegion *region = &migration->region.buffer;
> > > +    uint64_t data_offset = 0, data_size = 0;
> > > +    int ret;
> > > +
> > > +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> > > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > > +                                             data_offset));
> > > +    if (ret != sizeof(data_offset)) {
> > > +        error_report("Failed to get migration buffer data offset %d",
> > > +                     ret);
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> > > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > > +                                             data_size));
> > > +    if (ret != sizeof(data_size)) {
> > > +        error_report("Failed to get migration buffer data size %d",
> > > +                     ret);
> > > +        return -EINVAL;
> > > +    }
> > > +
> > how big is the data_size ? 
> > if this size is too big, it may take too much time and block others.
> > 
> > > +    if (data_size > 0) {
> > > +        void *buf = NULL;
> > > +        bool buffer_mmaped = false;
> > > +
> > > +        if (region->mmaps) {
> > > +            int i;
> > > +
> > > +            for (i = 0; i < region->nr_mmaps; i++) {
> > > +                if ((data_offset >= region->mmaps[i].offset) &&
> > > +                    (data_offset < region->mmaps[i].offset +
> > > +                                   region->mmaps[i].size)) {
> > > +                    buf = region->mmaps[i].mmap + (data_offset -
> > > +                                                   region->mmaps[i].offset);
> > > +                    buffer_mmaped = true;
> > > +                    break;
> > > +                }
> > > +            }
> > > +        }
> > > +
> > > +        if (!buffer_mmaped) {
> > > +            buf = g_malloc0(data_size);
> > > +            ret = pread(vbasedev->fd, buf, data_size,
> > > +                        region->fd_offset + data_offset);
> > > +            if (ret != data_size) {
> > > +                error_report("Failed to get migration data %d", ret);
> > > +                g_free(buf);
> > > +                return -EINVAL;
> > > +            }
> > > +        }
> > > +
> > > +        qemu_put_be64(f, data_size);
> > > +        qemu_put_buffer(f, buf, data_size);
> > > +
> > > +        if (!buffer_mmaped) {
> > > +            g_free(buf);
> > > +        }
> > > +        migration->pending_bytes -= data_size;
> > > +    } else {
> > > +        qemu_put_be64(f, data_size);
> > > +    }
> > > +
> > > +    ret = qemu_file_get_error(f);
> > > +
> > > +    return data_size;
> > > +}
> > > +
> > > +static int vfio_update_pending(VFIODevice *vbasedev)
> > > +{
> > > +    VFIOMigration *migration = vbasedev->migration;
> > > +    VFIORegion *region = &migration->region.buffer;
> > > +    uint64_t pending_bytes = 0;
> > > +    int ret;
> > > +
> > > +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> > > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > > +                                             pending_bytes));
> > > +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> > > +        error_report("Failed to get pending bytes %d", ret);
> > > +        migration->pending_bytes = 0;
> > > +        return (ret < 0) ? ret : -EINVAL;
> > > +    }
> > > +
> > > +    migration->pending_bytes = pending_bytes;
> > > +    return 0;
> > > +}
> > > +
> > > +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +
> > > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> > > +
> > > +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> > > +        vfio_pci_save_config(vbasedev, f);
> > > +    }
> > > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > +
> > > +    return qemu_file_get_error(f);
> > > +}
> > > +
> > >  /* ---------------------------------------------------------------------- */
> > >  
> > >  static int vfio_save_setup(QEMUFile *f, void *opaque)
> > > @@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
> > >      }
> > >  }
> > >  
> > > +static void vfio_save_pending(QEMUFile *f, void *opaque,
> > > +                              uint64_t threshold_size,
> > > +                              uint64_t *res_precopy_only,
> > > +                              uint64_t *res_compatible,
> > > +                              uint64_t *res_postcopy_only)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +    VFIOMigration *migration = vbasedev->migration;
> > > +    int ret;
> > > +
> > > +    ret = vfio_update_pending(vbasedev);
> > > +    if (ret) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> > > +        *res_precopy_only += migration->pending_bytes;
> > > +    } else {
> > > +        *res_postcopy_only += migration->pending_bytes;
> > > +    }
> by definition,
> - res_precopy_only is for data which must be migrated in precopy phase
>    or in stopped state, in other words - before target vm start
> - res_postcopy_only is for data which must be migrated in postcopy phase
>   or in stopped state, in other words - after source vm stop
> So, we can only determining data type by the nature of the data. i.e.
> if it is device state data which must be copied after source vm stop and
> before target vm start, it belongs to res_precopy_only.
> 
> It is not right to determining data type by current device state.

Right; you can determine it by whether postcopy is *enabled* or not.
However, since this isn't ready for postcopy yet anyway, just add it to
res_postcopy_only all the time;  then you can come back to postcopy
later.

Dave

> Thanks
> Yan
> 
> > > +    *res_compatible += 0;
> > > +}
> > > +
> > > +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +    VFIOMigration *migration = vbasedev->migration;
> > > +    int ret;
> > > +
> > > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> > > +
> > > +    qemu_mutex_lock(&migration->lock);
> > > +    ret = vfio_save_buffer(f, vbasedev);
> > > +    qemu_mutex_unlock(&migration->lock);
> > > +
> > > +    if (ret < 0) {
> > > +        error_report("vfio_save_buffer failed %s",
> > > +                     strerror(errno));
> > > +        return ret;
> > > +    }
> > > +
> > > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > +
> > > +    ret = qemu_file_get_error(f);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +    VFIOMigration *migration = vbasedev->migration;
> > > +    int ret;
> > > +
> > > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> > > +    if (ret) {
> > > +        error_report("Failed to set state STOP and SAVING");
> > > +        return ret;
> > > +    }
> > > +
> > > +    ret = vfio_save_device_config_state(f, opaque);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    ret = vfio_update_pending(vbasedev);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    while (migration->pending_bytes > 0) {
> > > +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> > > +        ret = vfio_save_buffer(f, vbasedev);
> > > +        if (ret < 0) {
> > > +            error_report("Failed to save buffer");
> > > +            return ret;
> > > +        } else if (ret == 0) {
> > > +            break;
> > > +        }
> > > +
> > > +        ret = vfio_update_pending(vbasedev);
> > > +        if (ret) {
> > > +            return ret;
> > > +        }
> > > +    }
> > > +
> > > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > +
> > > +    ret = qemu_file_get_error(f);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
> > > +    if (ret) {
> > > +        error_report("Failed to set state STOPPED");
> > > +        return ret;
> > > +    }
> > > +    return ret;
> > > +}
> > > +
> > >  static SaveVMHandlers savevm_vfio_handlers = {
> > >      .save_setup = vfio_save_setup,
> > >      .save_cleanup = vfio_save_cleanup,
> > > +    .save_live_pending = vfio_save_pending,
> > > +    .save_live_iterate = vfio_save_iterate,
> > > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > >  };
> > >  
> > >  /* ---------------------------------------------------------------------- */
> > > -- 
> > > 2.7.0
> > > 
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
  2019-06-20 19:25   ` Alex Williamson
  2019-06-21  0:31   ` Yan Zhao
@ 2019-06-28  9:09   ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 64+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-28  9:09 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	yulei.zhang, cohuck, shuangtai.tst, qemu-devel, zhi.a.wang,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes
> - read data_offset - indicates kernel driver to write data to staging
>   buffer which is mmapped.
> - read data_size - amount of data in bytes written by vendor driver in migration
>   region.
> - if data section is trapped, pread() number of bytes in data_size, from
>   data_offset.
> - if data section is mmaped, read mmaped buffer of size data_size.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes - indicates kernel driver to write data to staging
>    buffer which is mmapped.
> c. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> d. if data section is trapped, pread() from data_offset of size data_size.
> e. if data section is mmaped, read mmaped buffer of size data_size.
> f. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f until (pending_bytes > 0)
> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> .save_live_iterate runs outside the iothread lock in the migration case, which
> could race with asynchronous call to get dirty page list causing data corruption
> in mapped migration region. Mutex added here to serial migration buffer read
> operation.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 212 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index fe0887c27664..0a2f30872316 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>      return 0;
>  }
>  
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t data_offset = 0, data_size = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret != sizeof(data_offset)) {
> +        error_report("Failed to get migration buffer data offset %d",
> +                     ret);
> +        return -EINVAL;
> +    }

It feels like you need a helper function, something so that you can do
something like:

       if (!vfio_dev_read(vbasedev, &data_offset, sizeof(data_offset),
                          region->fd_offset + offsetof(struct vfio_device_migration_info,
                                                data_offset),
                          "data offset")) {
           return -EINVAL;
       }

> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret != sizeof(data_size)) {
> +        error_report("Failed to get migration buffer data size %d",
> +                     ret);
> +        return -EINVAL;
> +    }
> +
> +    if (data_size > 0) {
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        if (region->mmaps) {
> +            int i;
> +
> +            for (i = 0; i < region->nr_mmaps; i++) {
> +                if ((data_offset >= region->mmaps[i].offset) &&
> +                    (data_offset < region->mmaps[i].offset +
> +                                   region->mmaps[i].size)) {
> +                    buf = region->mmaps[i].mmap + (data_offset -
> +                                                   region->mmaps[i].offset);
> +                    buffer_mmaped = true;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        if (!buffer_mmaped) {
> +            buf = g_malloc0(data_size);
> +            ret = pread(vbasedev->fd, buf, data_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != data_size) {
> +                error_report("Failed to get migration data %d", ret);
> +                g_free(buf);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, data_size);
> +        qemu_put_buffer(f, buf, data_size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +        migration->pending_bytes -= data_size;
> +    } else {
> +        qemu_put_be64(f, data_size);
> +    }
> +
> +    ret = qemu_file_get_error(f);

You're ignoring that return value;  it's not that
important to check for errors on the saving side - although
you should if you're looping on data to fail quickly; it's more
of an issue on the load side.

> +    return data_size;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> +        error_report("Failed to get pending bytes %d", ret);
> +        migration->pending_bytes = 0;
> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_save_config(vbasedev, f);
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
>      }
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> +        *res_precopy_only += migration->pending_bytes;
> +    } else {
> +        *res_postcopy_only += migration->pending_bytes;
> +    }
> +    *res_compatible += 0;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    qemu_mutex_lock(&migration->lock);
> +    ret = vfio_save_buffer(f, vbasedev);
> +    qemu_mutex_unlock(&migration->lock);
> +
> +    if (ret < 0) {
> +        error_report("vfio_save_buffer failed %s",
> +                     strerror(errno));
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return ret;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("Failed to set state STOP and SAVING");
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev);
> +        if (ret < 0) {
> +            error_report("Failed to save buffer");
> +            return ret;
> +        } else if (ret == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
> +    if (ret) {
> +        error_report("Failed to set state STOPPED");
> +        return ret;
> +    }
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 09/13] vfio: Add load state functions to SaveVMHandlers
  2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 09/13] vfio: Add load " Kirti Wankhede
@ 2019-06-28  9:18   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 64+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-28  9:18 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	yulei.zhang, cohuck, shuangtai.tst, qemu-devel, zhi.a.wang,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> During _RESUMING device state:
> - If Vendor driver defines mappable region, mmap migration region.
> - Load config state.
> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
>     - read data_size from packet, read buffer of data_size
>     - read data_offset from where QEMU should write data.
>         if region is mmaped, write data of data_size to mmaped region.
>     - write data_size.
>         In case of mmapped region, write to data_size indicates kernel
>         driver that data is written in staging buffer.
>     - if region is trapped, pwrite() data of data_size from data_offset.
> - Repeat above until VFIO_MIG_FLAG_END_OF_STATE.
> - Unmap migration region.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 153 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 0a2f30872316..e4895f91761d 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -212,6 +212,22 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        vfio_pci_load_config(vbasedev, f);
> +    }
> +
> +    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("Wrong end of block while loading device config space");

You might find it useful to print the value in the error.

> +        return -EINVAL;
> +    }
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -372,12 +388,149 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      return ret;
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +
> +    if (migration->region.buffer.mmaps) {
> +        ret = vfio_region_mmap(&migration->region.buffer);
> +        if (ret) {
> +            error_report("Failed to mmap VFIO migration region %d: %s",
> +                         migration->region.index, strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING);
> +    if (ret) {
> +        error_report("Failed to set state RESUMING");
> +    }
> +    return ret;
> +}
> +
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    vfio_save_cleanup(opaque);
> +    return 0;
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +    uint64_t data, data_size;
> +
> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        switch (data) {
> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> +        {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> +        {
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return ret;
> +            } else {
> +                error_report("SETUP STATE: EOS not found 0x%lx", data);

Please use PRIx64 for uint64_t (I know it's painful, but it's portable).

> +                return -EINVAL;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> +        {
> +            VFIORegion *region = &migration->region.buffer;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +            uint64_t data_offset = 0;
> +
> +            data_size = qemu_get_be64(f);
> +            if (data_size == 0) {
> +                break;
> +            }
> +
> +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                        region->fd_offset +
> +                        offsetof(struct vfio_device_migration_info,
> +                        data_offset));
> +            if (ret != sizeof(data_offset)) {
> +                error_report("Failed to get migration buffer data offset %d",
> +                              ret);
> +                return -EINVAL;
> +            }
> +
> +            if (region->mmaps) {
> +                int i;
> +
> +                for (i = 0; i < region->nr_mmaps; i++) {
> +                    if (region->mmaps[i].mmap &&
> +                        (data_offset >= region->mmaps[i].offset) &&
> +                        (data_offset < region->mmaps[i].offset +
> +                                       region->mmaps[i].size)) {
> +                        buf = region->mmaps[i].mmap + (data_offset -
> +                                                      region->mmaps[i].offset);

What checks that the read data_size fits in the region?
[Treat the incoming stream as broken/malicious until proven otherwise]

> +                        buffer_mmaped = true;
> +                        break;
> +                    }
> +                }

I think you had some code like this on the save side?  Perhaps a
'find_region' helper would be good?

> +            }
> +
> +            if (!buffer_mmaped) {
> +                buf = g_malloc0(data_size);
> +            }

Given that data_size could be quite large, you might want to use
g_try_malloc0 and check the return value;  especially since 'data_size'
is read from the stream.

> +            qemu_get_buffer(f, buf, data_size);
> +
> +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
> +                         region->fd_offset +
> +                       offsetof(struct vfio_device_migration_info, data_size));
> +            if (ret != sizeof(data_size)) {
> +                error_report("Failed to set migration buffer data size %d",
> +                             ret);
> +                return -EINVAL;
> +            }
> +
> +            if (!buffer_mmaped) {
> +                ret = pwrite(vbasedev->fd, buf, data_size,
> +                             region->fd_offset + data_offset);
> +                g_free(buf);
> +
> +                if (ret != data_size) {
> +                    error_report("Failed to set migration buffer %d", ret);
> +                    return -EINVAL;
> +                }
> +            }
> +            break;
> +        }
> +        }
> +
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +        data = qemu_get_be64(f);
> +    }
> +
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
>      .save_live_pending = vfio_save_pending,
>      .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .load_state = vfio_load_state,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-26  0:43             ` Yan Zhao
@ 2019-06-28  9:44               ` Dr. David Alan Gilbert
  2019-06-28 21:28                 ` Yan Zhao
  0 siblings, 1 reply; 64+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-28  9:44 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	alex.williamson, Wang, Zhi A, mlevitsk, pasic, aik,
	Kirti Wankhede, eauger, felipe, jonathan.davies, Liu, Changpeng,
	Ken.Xue

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> On Tue, Jun 25, 2019 at 03:00:24AM +0800, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > 
> > > 
> > > On 6/21/2019 2:16 PM, Yan Zhao wrote:
> > > > On Fri, Jun 21, 2019 at 04:02:50PM +0800, Kirti Wankhede wrote:
> > > >>
> > > >>
> > > >> On 6/21/2019 6:54 AM, Yan Zhao wrote:
> > > >>> On Fri, Jun 21, 2019 at 08:25:18AM +0800, Yan Zhao wrote:
> > > >>>> On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
> > > >>>>> Add migration support for VFIO device
> > > >>>>>
> > > >>>>> This Patch set include patches as below:
> > > >>>>> - Define KABI for VFIO device for migration support.
> > > >>>>> - Added save and restore functions for PCI configuration space
> > > >>>>> - Generic migration functionality for VFIO device.
> > > >>>>>   * This patch set adds functionality only for PCI devices, but can be
> > > >>>>>     extended to other VFIO devices.
> > > >>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> > > >>>>>     resume phases of migration.
> > > >>>>>   * Added state change notifier and from that notifier function, VFIO
> > > >>>>>     device's state changed is conveyed to VFIO device driver.
> > > >>>>>   * During save setup phase and resume/load setup phase, migration region
> > > >>>>>     is queried and is used to read/write VFIO device data.
> > > >>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> > > >>>>>     functionality of iteration during pre-copy phase.
> > > >>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> > > >>>>>     iteration to read data from VFIO device driver is implemented till pending
> > > >>>>>     bytes returned by driver are not zero.
> > > >>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> > > >>>>>     driver.
> > > >>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> > > >>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> > > >>>>>   driver, migration is blocked.
> > > >>>>>
> > > >>>>> Below is the flow of state change for live migration where states in brackets
> > > >>>>> represent VM state, migration state and VFIO device state as:
> > > >>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > > >>>>>
> > > >>>>> Live migration save path:
> > > >>>>>         QEMU normal running state
> > > >>>>>         (RUNNING, _NONE, _RUNNING)
> > > >>>>>                         |
> > > >>>>>     migrate_init spawns migration_thread.
> > > >>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> > > >>>>>     Migration thread then calls each device's .save_setup()
> > > >>>>>                         |
> > > >>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> > > >>>>>     If device is active, get pending bytes by .save_live_pending()
> > > >>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> > > >>>>>     Data of VFIO device for pre-copy phase is copied.
> > > >>>>>     Iterate till pending bytes converge and are less than threshold
> > > >>>>>                         |
> > > >>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> > > >>>>>     for each active device. VFIO device is then transitioned in
> > > >>>>>      _SAVING state.
> > > >>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> > > >>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> > > >>>>>     pending data is 0.
> > > >>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> > > >>>>
> > > >>>> I suggest we also register to VMStateDescription, whose .pre_save
> > > >>>> handler would get called after .save_live_complete_precopy in pre-copy
> > > >>>> only case, and will called before .save_live_iterate in post-copy
> > > >>>> enabled case.
> > > >>>> In the .pre_save handler, we can save all device state which must be
> > > >>>> copied after device stop in source vm and before device start in target vm.
> > > >>>>
> > > >>> hi
> > > >>> to better describe this idea:
> > > >>>
> > > >>> in pre-copy only case, the flow is
> > > >>>
> > > >>> start migration --> .save_live_iterate (several round) -> stop source vm
> > > >>> --> .save_live_complete_precopy --> .pre_save  -->start target vm
> > > >>> -->migration complete
> > > >>>
> > > >>>
> > > >>> in post-copy enabled case, the flow is
> > > >>>
> > > >>> start migration --> .save_live_iterate (several round) --> start post copy --> 
> > > >>> stop source vm --> .pre_save --> start target vm --> .save_live_iterate (several round) 
> > > >>> -->migration complete
> > > >>>
> > > >>> Therefore, we should put saving of device state in .pre_save interface
> > > >>> rather than in .save_live_complete_precopy. 
> > > >>> The device state includes pci config data, page tables, register state, etc.
> > > >>>
> > > >>> The .save_live_iterate and .save_live_complete_precopy should only deal
> > > >>> with saving dirty memory.
> > > >>>
> > > >>
> > > >> Vendor driver can decide when to save device state depending on the VFIO
> > > >> device state set by user. Vendor driver doesn't have to depend on which
> > > >> callback function QEMU or user application calls. In pre-copy case,
> > > >> save_live_complete_precopy sets VFIO device state to
> > > >> VFIO_DEVICE_STATE_SAVING which means vCPUs are stopped and vendor driver
> > > >> should save all device state.
> > > >>
> > > > when post copy stops vCPUs and vfio device, vendor driver only needs to
> > > > provide device state. but how vendor driver knows that, if no extra
> > > > interface or no extra device state is provides?
> > > > 
> > > 
> > > .save_live_complete_postcopy interface for post-copy will get called,
> > > right?
> > 
> > That happens at the very end; I think the question here is for something
> > that gets called at the point we stop iteratively sending RAM, send the
> > device states and then start sending RAM on demand to the destination
> > as it's running. Typically we send a small set of device state
> > (registers etc) at this point.
> > 
> > I guess there's two different postcopy cases that we need to think
> > about:
> >   a) Where the VFIO device doesn't support postcopy - it just gets
> >   migrated like any other device, so all it's RAM must get sent
> >   before we flip into postcopy mode.
> > 
> >   b) Where the VFIO device does support postcopy - where the pages
> >   get sent on demand.
> > 
> > (b) maybe tricky depending on whether your hardware can fault
> > on pages of your RAM that are needed but not yet transferred;  but
> > if you can that would make life a lot more practical on really
> > big VFO devices.
> > 
> > Dave
> >
> hi Dave,
> so do you think it is good to abstract device state data and save it in
> .pre_save callback?

I'm not sure we have a vmsd/pre_save in this setup?  If we did then it's
a bit confusing because I don't think we have any other iterative device
that also has a vmsd.

I'd have to test it, but I think you might get the devices
->save_live_complete_precopy called at the right point just before
postcopy switchover.  It's worth looking.

Dave

> Thanks
> Yan
> 
> > > Thanks,
> > > Kirti
> > > 
> > > >>>
> > > >>> I know current implementation does not support post-copy. but at least
> > > >>> it should not require huge change when we decide to enable it in future.
> > > >>>
> > > >>
> > > >> .has_postcopy and .save_live_complete_postcopy need to be implemented to
> > > >> support post-copy. I think .save_live_complete_postcopy should be
> > > >> similar to vfio_save_complete_precopy.
> > > >>
> > > >> Thanks,
> > > >> Kirti
> > > >>
> > > >>> Thanks
> > > >>> Yan
> > > >>>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers
  2019-06-28  8:50       ` Dr. David Alan Gilbert
@ 2019-06-28 21:16         ` Yan Zhao
  0 siblings, 0 replies; 64+ messages in thread
From: Yan Zhao @ 2019-06-28 21:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pasic, Tian, Kevin, Liu, Yi L, cjia, Ken.Xue, eskultet, Yang,
	Ziye, qemu-devel, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	alex.williamson, mlevitsk, yulei.zhang, aik, Kirti Wankhede,
	eauger, cohuck, jonathan.davies, felipe, Liu, Changpeng, Wang,
	Zhi A

On Fri, Jun 28, 2019 at 04:50:30PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > On Fri, Jun 21, 2019 at 08:31:53AM +0800, Yan Zhao wrote:
> > > On Thu, Jun 20, 2019 at 10:37:36PM +0800, Kirti Wankhede wrote:
> > > > Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> > > > functions. These functions handles pre-copy and stop-and-copy phase.
> > > > 
> > > > In _SAVING|_RUNNING device state or pre-copy phase:
> > > > - read pending_bytes
> > > > - read data_offset - indicates kernel driver to write data to staging
> > > >   buffer which is mmapped.
> > > > - read data_size - amount of data in bytes written by vendor driver in migration
> > > >   region.
> > > > - if data section is trapped, pread() number of bytes in data_size, from
> > > >   data_offset.
> > > > - if data section is mmaped, read mmaped buffer of size data_size.
> > > > - Write data packet to file stream as below:
> > > > {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> > > > VFIO_MIG_FLAG_END_OF_STATE }
> > > > 
> > > > In _SAVING device state or stop-and-copy phase
> > > > a. read config space of device and save to migration file stream. This
> > > >    doesn't need to be from vendor driver. Any other special config state
> > > >    from driver can be saved as data in following iteration.
> > > > b. read pending_bytes - indicates kernel driver to write data to staging
> > > >    buffer which is mmapped.
> > > > c. read data_size - amount of data in bytes written by vendor driver in
> > > >    migration region.
> > > > d. if data section is trapped, pread() from data_offset of size data_size.
> > > > e. if data section is mmaped, read mmaped buffer of size data_size.
> > > > f. Write data packet as below:
> > > >    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> > > > g. iterate through steps b to f until (pending_bytes > 0)
> > > > h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> > > > 
> > > > .save_live_iterate runs outside the iothread lock in the migration case, which
> > > > could race with asynchronous call to get dirty page list causing data corruption
> > > > in mapped migration region. Mutex added here to serial migration buffer read
> > > > operation.
> > > > 
> > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > ---
> > > >  hw/vfio/migration.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 212 insertions(+)
> > > > 
> > > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > > index fe0887c27664..0a2f30872316 100644
> > > > --- a/hw/vfio/migration.c
> > > > +++ b/hw/vfio/migration.c
> > > > @@ -107,6 +107,111 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> > > >      return 0;
> > > >  }
> > > >  
> > > > +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> > > > +{
> > > > +    VFIOMigration *migration = vbasedev->migration;
> > > > +    VFIORegion *region = &migration->region.buffer;
> > > > +    uint64_t data_offset = 0, data_size = 0;
> > > > +    int ret;
> > > > +
> > > > +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> > > > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > > > +                                             data_offset));
> > > > +    if (ret != sizeof(data_offset)) {
> > > > +        error_report("Failed to get migration buffer data offset %d",
> > > > +                     ret);
> > > > +        return -EINVAL;
> > > > +    }
> > > > +
> > > > +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> > > > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > > > +                                             data_size));
> > > > +    if (ret != sizeof(data_size)) {
> > > > +        error_report("Failed to get migration buffer data size %d",
> > > > +                     ret);
> > > > +        return -EINVAL;
> > > > +    }
> > > > +
> > > how big is the data_size ? 
> > > if this size is too big, it may take too much time and block others.
> > > 
> > > > +    if (data_size > 0) {
> > > > +        void *buf = NULL;
> > > > +        bool buffer_mmaped = false;
> > > > +
> > > > +        if (region->mmaps) {
> > > > +            int i;
> > > > +
> > > > +            for (i = 0; i < region->nr_mmaps; i++) {
> > > > +                if ((data_offset >= region->mmaps[i].offset) &&
> > > > +                    (data_offset < region->mmaps[i].offset +
> > > > +                                   region->mmaps[i].size)) {
> > > > +                    buf = region->mmaps[i].mmap + (data_offset -
> > > > +                                                   region->mmaps[i].offset);
> > > > +                    buffer_mmaped = true;
> > > > +                    break;
> > > > +                }
> > > > +            }
> > > > +        }
> > > > +
> > > > +        if (!buffer_mmaped) {
> > > > +            buf = g_malloc0(data_size);
> > > > +            ret = pread(vbasedev->fd, buf, data_size,
> > > > +                        region->fd_offset + data_offset);
> > > > +            if (ret != data_size) {
> > > > +                error_report("Failed to get migration data %d", ret);
> > > > +                g_free(buf);
> > > > +                return -EINVAL;
> > > > +            }
> > > > +        }
> > > > +
> > > > +        qemu_put_be64(f, data_size);
> > > > +        qemu_put_buffer(f, buf, data_size);
> > > > +
> > > > +        if (!buffer_mmaped) {
> > > > +            g_free(buf);
> > > > +        }
> > > > +        migration->pending_bytes -= data_size;
> > > > +    } else {
> > > > +        qemu_put_be64(f, data_size);
> > > > +    }
> > > > +
> > > > +    ret = qemu_file_get_error(f);
> > > > +
> > > > +    return data_size;
> > > > +}
> > > > +
> > > > +static int vfio_update_pending(VFIODevice *vbasedev)
> > > > +{
> > > > +    VFIOMigration *migration = vbasedev->migration;
> > > > +    VFIORegion *region = &migration->region.buffer;
> > > > +    uint64_t pending_bytes = 0;
> > > > +    int ret;
> > > > +
> > > > +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> > > > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > > > +                                             pending_bytes));
> > > > +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> > > > +        error_report("Failed to get pending bytes %d", ret);
> > > > +        migration->pending_bytes = 0;
> > > > +        return (ret < 0) ? ret : -EINVAL;
> > > > +    }
> > > > +
> > > > +    migration->pending_bytes = pending_bytes;
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> > > > +{
> > > > +    VFIODevice *vbasedev = opaque;
> > > > +
> > > > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> > > > +
> > > > +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> > > > +        vfio_pci_save_config(vbasedev, f);
> > > > +    }
> > > > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > > +
> > > > +    return qemu_file_get_error(f);
> > > > +}
> > > > +
> > > >  /* ---------------------------------------------------------------------- */
> > > >  
> > > >  static int vfio_save_setup(QEMUFile *f, void *opaque)
> > > > @@ -163,9 +268,116 @@ static void vfio_save_cleanup(void *opaque)
> > > >      }
> > > >  }
> > > >  
> > > > +static void vfio_save_pending(QEMUFile *f, void *opaque,
> > > > +                              uint64_t threshold_size,
> > > > +                              uint64_t *res_precopy_only,
> > > > +                              uint64_t *res_compatible,
> > > > +                              uint64_t *res_postcopy_only)
> > > > +{
> > > > +    VFIODevice *vbasedev = opaque;
> > > > +    VFIOMigration *migration = vbasedev->migration;
> > > > +    int ret;
> > > > +
> > > > +    ret = vfio_update_pending(vbasedev);
> > > > +    if (ret) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> > > > +        *res_precopy_only += migration->pending_bytes;
> > > > +    } else {
> > > > +        *res_postcopy_only += migration->pending_bytes;
> > > > +    }
> > by definition,
> > - res_precopy_only is for data which must be migrated in precopy phase
> >    or in stopped state, in other words - before target vm start
> > - res_postcopy_only is for data which must be migrated in postcopy phase
> >   or in stopped state, in other words - after source vm stop
> > So, we can only determining data type by the nature of the data. i.e.
> > if it is device state data which must be copied after source vm stop and
> > before target vm start, it belongs to res_precopy_only.
> > 
> > It is not right to determining data type by current device state.
> 
> Right; you can determine it by whether postcopy is *enabled* or not.
> However, since this isn't ready for postcopy yet anyway, just add it to
> res_postcopy_only all the time;  then you can come back to postcopy
hi Dave,
do you mean "add it to res_precopy_only all the time" here?

Thanks
Yan


> later.
> 
> Dave
> 
> > Thanks
> > Yan
> > 
> > > > +    *res_compatible += 0;
> > > > +}
> > > > +
> > > > +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> > > > +{
> > > > +    VFIODevice *vbasedev = opaque;
> > > > +    VFIOMigration *migration = vbasedev->migration;
> > > > +    int ret;
> > > > +
> > > > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> > > > +
> > > > +    qemu_mutex_lock(&migration->lock);
> > > > +    ret = vfio_save_buffer(f, vbasedev);
> > > > +    qemu_mutex_unlock(&migration->lock);
> > > > +
> > > > +    if (ret < 0) {
> > > > +        error_report("vfio_save_buffer failed %s",
> > > > +                     strerror(errno));
> > > > +        return ret;
> > > > +    }
> > > > +
> > > > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > > +
> > > > +    ret = qemu_file_get_error(f);
> > > > +    if (ret) {
> > > > +        return ret;
> > > > +    }
> > > > +
> > > > +    return ret;
> > > > +}
> > > > +
> > > > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > > > +{
> > > > +    VFIODevice *vbasedev = opaque;
> > > > +    VFIOMigration *migration = vbasedev->migration;
> > > > +    int ret;
> > > > +
> > > > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> > > > +    if (ret) {
> > > > +        error_report("Failed to set state STOP and SAVING");
> > > > +        return ret;
> > > > +    }
> > > > +
> > > > +    ret = vfio_save_device_config_state(f, opaque);
> > > > +    if (ret) {
> > > > +        return ret;
> > > > +    }
> > > > +
> > > > +    ret = vfio_update_pending(vbasedev);
> > > > +    if (ret) {
> > > > +        return ret;
> > > > +    }
> > > > +
> > > > +    while (migration->pending_bytes > 0) {
> > > > +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> > > > +        ret = vfio_save_buffer(f, vbasedev);
> > > > +        if (ret < 0) {
> > > > +            error_report("Failed to save buffer");
> > > > +            return ret;
> > > > +        } else if (ret == 0) {
> > > > +            break;
> > > > +        }
> > > > +
> > > > +        ret = vfio_update_pending(vbasedev);
> > > > +        if (ret) {
> > > > +            return ret;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > > +
> > > > +    ret = qemu_file_get_error(f);
> > > > +    if (ret) {
> > > > +        return ret;
> > > > +    }
> > > > +
> > > > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOPPED);
> > > > +    if (ret) {
> > > > +        error_report("Failed to set state STOPPED");
> > > > +        return ret;
> > > > +    }
> > > > +    return ret;
> > > > +}
> > > > +
> > > >  static SaveVMHandlers savevm_vfio_handlers = {
> > > >      .save_setup = vfio_save_setup,
> > > >      .save_cleanup = vfio_save_cleanup,
> > > > +    .save_live_pending = vfio_save_pending,
> > > > +    .save_live_iterate = vfio_save_iterate,
> > > > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > > >  };
> > > >  
> > > >  /* ---------------------------------------------------------------------- */
> > > > -- 
> > > > 2.7.0
> > > > 
> > > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device
  2019-06-28  9:44               ` Dr. David Alan Gilbert
@ 2019-06-28 21:28                 ` Yan Zhao
  0 siblings, 0 replies; 64+ messages in thread
From: Yan Zhao @ 2019-06-28 21:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	alex.williamson, Wang, Zhi A, mlevitsk, pasic, aik,
	Kirti Wankhede, eauger, felipe, jonathan.davies, Liu, Changpeng,
	Ken.Xue

On Fri, Jun 28, 2019 at 05:44:47PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > On Tue, Jun 25, 2019 at 03:00:24AM +0800, Dr. David Alan Gilbert wrote:
> > > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > > 
> > > > 
> > > > On 6/21/2019 2:16 PM, Yan Zhao wrote:
> > > > > On Fri, Jun 21, 2019 at 04:02:50PM +0800, Kirti Wankhede wrote:
> > > > >>
> > > > >>
> > > > >> On 6/21/2019 6:54 AM, Yan Zhao wrote:
> > > > >>> On Fri, Jun 21, 2019 at 08:25:18AM +0800, Yan Zhao wrote:
> > > > >>>> On Thu, Jun 20, 2019 at 10:37:28PM +0800, Kirti Wankhede wrote:
> > > > >>>>> Add migration support for VFIO device
> > > > >>>>>
> > > > >>>>> This Patch set include patches as below:
> > > > >>>>> - Define KABI for VFIO device for migration support.
> > > > >>>>> - Added save and restore functions for PCI configuration space
> > > > >>>>> - Generic migration functionality for VFIO device.
> > > > >>>>>   * This patch set adds functionality only for PCI devices, but can be
> > > > >>>>>     extended to other VFIO devices.
> > > > >>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> > > > >>>>>     resume phases of migration.
> > > > >>>>>   * Added state change notifier and from that notifier function, VFIO
> > > > >>>>>     device's state changed is conveyed to VFIO device driver.
> > > > >>>>>   * During save setup phase and resume/load setup phase, migration region
> > > > >>>>>     is queried and is used to read/write VFIO device data.
> > > > >>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> > > > >>>>>     functionality of iteration during pre-copy phase.
> > > > >>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> > > > >>>>>     iteration to read data from VFIO device driver is implemented till pending
> > > > >>>>>     bytes returned by driver are not zero.
> > > > >>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> > > > >>>>>     driver.
> > > > >>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> > > > >>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> > > > >>>>>   driver, migration is blocked.
> > > > >>>>>
> > > > >>>>> Below is the flow of state change for live migration where states in brackets
> > > > >>>>> represent VM state, migration state and VFIO device state as:
> > > > >>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > > > >>>>>
> > > > >>>>> Live migration save path:
> > > > >>>>>         QEMU normal running state
> > > > >>>>>         (RUNNING, _NONE, _RUNNING)
> > > > >>>>>                         |
> > > > >>>>>     migrate_init spawns migration_thread.
> > > > >>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> > > > >>>>>     Migration thread then calls each device's .save_setup()
> > > > >>>>>                         |
> > > > >>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> > > > >>>>>     If device is active, get pending bytes by .save_live_pending()
> > > > >>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> > > > >>>>>     Data of VFIO device for pre-copy phase is copied.
> > > > >>>>>     Iterate till pending bytes converge and are less than threshold
> > > > >>>>>                         |
> > > > >>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> > > > >>>>>     for each active device. VFIO device is then transitioned in
> > > > >>>>>      _SAVING state.
> > > > >>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> > > > >>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> > > > >>>>>     pending data is 0.
> > > > >>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> > > > >>>>
> > > > >>>> I suggest we also register to VMStateDescription, whose .pre_save
> > > > >>>> handler would get called after .save_live_complete_precopy in pre-copy
> > > > >>>> only case, and will called before .save_live_iterate in post-copy
> > > > >>>> enabled case.
> > > > >>>> In the .pre_save handler, we can save all device state which must be
> > > > >>>> copied after device stop in source vm and before device start in target vm.
> > > > >>>>
> > > > >>> hi
> > > > >>> to better describe this idea:
> > > > >>>
> > > > >>> in pre-copy only case, the flow is
> > > > >>>
> > > > >>> start migration --> .save_live_iterate (several round) -> stop source vm
> > > > >>> --> .save_live_complete_precopy --> .pre_save  -->start target vm
> > > > >>> -->migration complete
> > > > >>>
> > > > >>>
> > > > >>> in post-copy enabled case, the flow is
> > > > >>>
> > > > >>> start migration --> .save_live_iterate (several round) --> start post copy --> 
> > > > >>> stop source vm --> .pre_save --> start target vm --> .save_live_iterate (several round) 
> > > > >>> -->migration complete
> > > > >>>
> > > > >>> Therefore, we should put saving of device state in .pre_save interface
> > > > >>> rather than in .save_live_complete_precopy. 
> > > > >>> The device state includes pci config data, page tables, register state, etc.
> > > > >>>
> > > > >>> The .save_live_iterate and .save_live_complete_precopy should only deal
> > > > >>> with saving dirty memory.
> > > > >>>
> > > > >>
> > > > >> Vendor driver can decide when to save device state depending on the VFIO
> > > > >> device state set by user. Vendor driver doesn't have to depend on which
> > > > >> callback function QEMU or user application calls. In pre-copy case,
> > > > >> save_live_complete_precopy sets VFIO device state to
> > > > >> VFIO_DEVICE_STATE_SAVING which means vCPUs are stopped and vendor driver
> > > > >> should save all device state.
> > > > >>
> > > > > when post copy stops vCPUs and vfio device, vendor driver only needs to
> > > > > provide device state. but how vendor driver knows that, if no extra
> > > > > interface or no extra device state is provides?
> > > > > 
> > > > 
> > > > .save_live_complete_postcopy interface for post-copy will get called,
> > > > right?
> > > 
> > > That happens at the very end; I think the question here is for something
> > > that gets called at the point we stop iteratively sending RAM, send the
> > > device states and then start sending RAM on demand to the destination
> > > as it's running. Typically we send a small set of device state
> > > (registers etc) at this point.
> > > 
> > > I guess there's two different postcopy cases that we need to think
> > > about:
> > >   a) Where the VFIO device doesn't support postcopy - it just gets
> > >   migrated like any other device, so all it's RAM must get sent
> > >   before we flip into postcopy mode.
> > > 
> > >   b) Where the VFIO device does support postcopy - where the pages
> > >   get sent on demand.
> > > 
> > > (b) maybe tricky depending on whether your hardware can fault
> > > on pages of your RAM that are needed but not yet transferred;  but
> > > if you can that would make life a lot more practical on really
> > > big VFO devices.
> > > 
> > > Dave
> > >
> > hi Dave,
> > so do you think it is good to abstract device state data and save it in
> > .pre_save callback?
> 
> I'm not sure we have a vmsd/pre_save in this setup?  If we did then it's
> a bit confusing because I don't think we have any other iterative device
> that also has a vmsd.
Yes, I tried it. it's ok to register SaveVMHandlers and VMStateDescription at the
same time.

> 
> I'd have to test it, but I think you might get the devices
> ->save_live_complete_precopy called at the right point just before
> postcopy switchover.  It's worth looking.
> 
if a iterative device supports postcopy, then its save_live_complete_precopy
would not get called before postcopy switchover.
However, postcopy may need to save pure device state only data (not memory) at that
time. That's the reason I think we should also register to
VMStateDescription also, as its .pre_save handler would get called at
at that time.

Thanks
Yan

> Dave
> 
> > Thanks
> > Yan
> > 
> > > > Thanks,
> > > > Kirti
> > > > 
> > > > >>>
> > > > >>> I know current implementation does not support post-copy. but at least
> > > > >>> it should not require huge change when we decide to enable it in future.
> > > > >>>
> > > > >>
> > > > >> .has_postcopy and .save_live_complete_postcopy need to be implemented to
> > > > >> support post-copy. I think .save_live_complete_postcopy should be
> > > > >> similar to vfio_save_complete_precopy.
> > > > >>
> > > > >> Thanks,
> > > > >> Kirti
> > > > >>
> > > > >>> Thanks
> > > > >>> Yan
> > > > >>>
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2019-06-28 21:36 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface Kirti Wankhede
2019-06-20 17:18   ` Alex Williamson
2019-06-21  5:52     ` Kirti Wankhede
2019-06-21 15:03       ` Alex Williamson
2019-06-21 19:35         ` Kirti Wankhede
2019-06-21 20:00           ` Alex Williamson
2019-06-21 20:30             ` Kirti Wankhede
2019-06-21 22:01               ` Alex Williamson
2019-06-24 15:00                 ` Kirti Wankhede
2019-06-24 15:25                   ` Alex Williamson
2019-06-24 18:52                     ` Kirti Wankhede
2019-06-24 19:01                       ` Alex Williamson
2019-06-25 15:20                         ` Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-06-21  0:12   ` Yan Zhao
2019-06-21  6:44     ` Kirti Wankhede
2019-06-21  7:50       ` Yan Zhao
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 04/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
2019-06-24 14:00   ` Cornelia Huck
2019-06-27 14:56     ` Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 05/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2019-06-25 10:29   ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 06/13] vfio: Add migration state change notifier Kirti Wankhede
2019-06-27 10:33   ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 07/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2019-06-27 10:01   ` Dr. David Alan Gilbert
2019-06-27 14:31     ` Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2019-06-20 19:25   ` Alex Williamson
2019-06-21  6:38     ` Kirti Wankhede
2019-06-21 15:16       ` Alex Williamson
2019-06-21 19:38         ` Kirti Wankhede
2019-06-21 20:02           ` Alex Williamson
2019-06-21 20:07             ` Kirti Wankhede
2019-06-21 20:32               ` Alex Williamson
2019-06-21 21:05                 ` Kirti Wankhede
2019-06-21 22:13                   ` Alex Williamson
2019-06-24 14:31                     ` Kirti Wankhede
2019-06-21  0:31   ` Yan Zhao
2019-06-25  3:30     ` Yan Zhao
2019-06-28  8:50       ` Dr. David Alan Gilbert
2019-06-28 21:16         ` Yan Zhao
2019-06-28  9:09   ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 09/13] vfio: Add load " Kirti Wankhede
2019-06-28  9:18   ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 10/13] vfio: Add function to get dirty page list Kirti Wankhede
2019-06-26  0:40   ` Yan Zhao
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 11/13] vfio: Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 12/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 13/13] vfio: Add trace events in migration code path Kirti Wankhede
2019-06-20 18:50   ` Dr. David Alan Gilbert
2019-06-21  5:54     ` Kirti Wankhede
2019-06-21  0:25 ` [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Yan Zhao
2019-06-21  1:24   ` Yan Zhao
2019-06-21  8:02     ` Kirti Wankhede
2019-06-21  8:46       ` Yan Zhao
2019-06-21  9:22         ` Kirti Wankhede
2019-06-21 10:45           ` Yan Zhao
2019-06-24 19:00           ` Dr. David Alan Gilbert
2019-06-26  0:43             ` Yan Zhao
2019-06-28  9:44               ` Dr. David Alan Gilbert
2019-06-28 21:28                 ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).