All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device
@ 2019-02-19 21:23 Kirti Wankhede
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface Kirti Wankhede
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Kirti Wankhede @ 2019-02-19 21:23 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel, Kirti Wankhede

Add migration support for VFIO device

This Patch set include patches as below:
- Define KABI for VFIO device for migration support.
- Added save and restore functions for PCI configuration space
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.
  * Added function to get dirty pages bitmap for the pages which are used by
    driver.
- Add vfio_listerner_log_sync to mark dirty pages.
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    migration_completion() stops vCPUs and calls .save_live_complete_precopy
    for each active  device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, 0)
                        |
    (FINISH_MIGRATE, _COMPLETED, 0)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, 0)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
    (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.
- VFIO device driver version compatibility is not taken care in this series.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti

Kirti Wankhede (5):
  VFIO KABI for migration interface
  Add save and load functions for VFIO PCI devices
  Add migration functions for VFIO devices
  Add vfio_listerner_log_sync to mark dirty pages
  Make vfio-pci device migration capable.

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  31 ++
 hw/vfio/migration.c           | 714 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 117 ++++++-
 hw/vfio/pci.h                 |  29 ++
 include/hw/vfio/vfio-common.h |  20 ++
 linux-headers/linux/vfio.h    |  65 ++++
 7 files changed, 971 insertions(+), 7 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface
  2019-02-19 21:23 [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Kirti Wankhede
@ 2019-02-19 21:23 ` Kirti Wankhede
  2019-02-21 22:23   ` Alex Williamson
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 2/5] Add save and load functions for VFIO PCI devices Kirti Wankhede
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Kirti Wankhede @ 2019-02-19 21:23 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel, Kirti Wankhede

- Defined MIGRATION region type and sub-type.
- Used 2 bits to define VFIO device states.
    Bit 0 => 0/1 => _STOPPED/_RUNNING
    Bit 1 => 0/1 => _RESUMING/_SAVING
    Combination of these bits defines VFIO device's state during migration
    _RUNNING => Normal VFIO device running state.
    _STOPPED => VFIO device stopped.
    _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
                          saving state of device i.e. pre-copy state
    _SAVING | _STOPPED => vCPUs are stoppped, VFIO device should be stopped, and
                          save device state,i.e. stop-n-copy state
    _RESUMING => VFIO device resuming state.
- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access:
    * device_state: (write only)
        To convey VFIO device state to be transitioned to.
    * pending bytes: (read only)
        To get pending bytes yet to be migrated for VFIO device
    * data_offset: (read/write)
        To get or set data offset in migration from where data exist
        during _SAVING and _RESUMING state
    * data_size: (write only)
        To convey size of data copied in migration region during _RESUMING
        state
    * start_pfn, page_size, total_pfns: (write only)
        To get bitmap of dirty pages from vendor driver from given
        start address for total_pfns.
    * copied_pfns: (read only)
        To get number of pfns bitmap copied in migration region.
        Vendor driver should copy the bitmap with bits set only for
        pages to be marked dirty in migration region. Vendor driver
        should return 0 if there are 0 pages dirty in requested
        range.

Migration region looks like:
 ------------------------------------------------------------------
|vfio_device_migration_info|    data section                      |
|                          |     ///////////////////////////////  |
 ------------------------------------------------------------------
 ^                              ^                              ^
 offset 0-trapped part        data.offset                 data.size

Data section is always followed by vfio_device_migration_info
structure in the region, so data.offset will always be none-0.
Offset from where data is copied is decided by kernel driver, data
section can be trapped or mapped depending on how kernel driver
defines data section. If mmapped, then data.offset should be page
aligned, where as initial section which contain
vfio_device_migration_info structure might not end at offset which
is page aligned.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 12a7b1dc53c8..1b12a9b95e00 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -368,6 +368,71 @@ struct vfio_region_gfx_edid {
  */
 #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
 
+/* Migration region type and sub-type */
+#define VFIO_REGION_TYPE_MIGRATION	        (2)
+#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
+
+/**
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise should return error.
+ *
+ * device_state: (write only)
+ *      To indicate vendor driver the state VFIO device should be transitioned
+ *      to. If device state transition fails, write to this field return error.
+ *      It consists of 2 bits.
+ *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
+ *        _STOPPED state. When device is changed to _STOPPED, driver should stop
+ *        device before write returns.
+ *      - If bit 1 set, indicates _SAVING state. When its reset, that indicates
+ *        _RESUMING state.
+ *
+ * pending bytes: (read only)
+ *      Read pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read/write)
+ *      User application should read data_offset in migration region from where
+ *      user application should read data during _SAVING state.
+ *      User application would write data_offset in migration region from where
+ *      user application is had written data during _RESUMING state.
+ *
+ * data_size: (write only)
+ *      User application should write size of data copied in migration region
+ *      during _RESUMING state.
+ *
+ * start_pfn: (write only)
+ *      Start address pfn to get bitmap of dirty pages from vendor driver duing
+ *      _SAVING state.
+ *
+ * page_size: (write only)
+ *      User application should write the page_size of pfn.
+ *
+ * total_pfns: (write only)
+ *      Total pfn count from start_pfn for which dirty bitmap is requested.
+ *
+ * copied_pfns: (read only)
+ *      pfn count for which dirty bitmap is copied to migration region.
+ *      Vendor driver should copy the bitmap with bits set only for pages to be
+ *      marked dirty in migration region.
+ *      Vendor driver should return 0 if there are 0 pages dirty in requested
+ *      range.
+ */
+
+struct vfio_device_migration_info {
+        __u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+        __u32 reserved;
+        __u64 pending_bytes;
+        __u64 data_offset;
+        __u64 data_size;
+        __u64 start_pfn;
+        __u64 page_size;
+        __u64 total_pfns;
+        __u64 copied_pfns;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v3 2/5] Add save and load functions for VFIO PCI devices
  2019-02-19 21:23 [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Kirti Wankhede
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface Kirti Wankhede
@ 2019-02-19 21:23 ` Kirti Wankhede
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices Kirti Wankhede
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Kirti Wankhede @ 2019-02-19 21:23 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel, Kirti Wankhede

These functions save and restore PCI device specific data - config
space of PCI device.
Tested save and restore with MSI and MSIX type.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h |  29 ++++++++++++++++
 2 files changed, 135 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index dd12f363915d..e87a8a03d3f3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1237,6 +1237,112 @@ void vfio_pci_write_config(PCIDevice *pdev,
     }
 }
 
+void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    int i;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar;
+
+        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+        qemu_put_be32(f, bar);
+    }
+
+    qemu_put_be32(f, vdev->interrupt);
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+
+        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                                            2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        qemu_put_be32(f, msi_addr_lo);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                             4);
+        }
+        qemu_put_be32(f, msi_addr_hi);
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+        qemu_put_be32(f, msi_data);
+    } else if (vdev->interrupt == VFIO_INT_MSIX) {
+        uint16_t offset;
+
+        /* save enable bit and maskall bit */
+        offset = pci_default_read_config(pdev,
+                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
+        qemu_put_be16(f, offset);
+        msix_save(pdev, f);
+    }
+}
+
+void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t pci_cmd, interrupt_type;
+    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+    bool msi_64bit;
+    int i;
+
+    /* retore pci bar configuration */
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar = qemu_get_be32(f);
+
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+    }
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
+
+    interrupt_type = qemu_get_be32(f);
+
+    if (interrupt_type == VFIO_INT_MSI) {
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+        msi_addr_lo = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo, 4);
+
+        msi_addr_hi = qemu_get_be32(f);
+        if (msi_64bit) {
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+        msi_data = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                msi_data, 2);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
+    } else if (interrupt_type == VFIO_INT_MSIX) {
+        uint16_t offset = qemu_get_be16(f);
+
+        /* load enable bit and maskall bit */
+        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
+                              offset, 2);
+        msix_load(pdev, f);
+    }
+}
+
 /*
  * Interrupt setup
  */
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index b1ae4c07549a..77d3223481b4 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -20,6 +20,7 @@
 #include "qemu/queue.h"
 #include "qemu/timer.h"
 
+#ifdef CONFIG_LINUX
 #define PCI_ANY_ID (~0)
 
 struct VFIOPCIDevice;
@@ -199,4 +200,32 @@ void vfio_display_reset(VFIOPCIDevice *vdev);
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
 void vfio_display_finalize(VFIOPCIDevice *vdev);
 
+void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f);
+void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f);
+
+static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+    return OBJECT(vdev);
+}
+
+#else
+static inline void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    g_assert(false);
+}
+
+static inline void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    g_assert(false);
+}
+
+static inline Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    return NULL;
+}
+
+#endif
+
 #endif /* HW_VFIO_VFIO_PCI_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices
  2019-02-19 21:23 [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Kirti Wankhede
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface Kirti Wankhede
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 2/5] Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2019-02-19 21:23 ` Kirti Wankhede
  2019-02-21 22:38   ` Alex Williamson
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 4/5] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Kirti Wankhede @ 2019-02-19 21:23 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel, Kirti Wankhede

- Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
- Added SaveVMHandlers and implemented all basic functions required for live
  migration.
- Added VM state change handler to know running or stopped state of VM.
- Added migration state change notifier to get notification on migration state
  change. This state is translated to VFIO device state and conveyed to vendor
  driver.
- VFIO device supports migration or not is decided based of migration region
  query. If migration region query is successful then migration is supported
  else migration is blocked.
- Structure vfio_device_migration_info is mapped at 0th offset of migration
  region and should always trapped by VFIO device's driver. Added both type of
  access support, trapped or mmapped, for data section of the region.
- To save device state, read pending_bytes and data_offset using structure
  vfio_device_migration_info, accordingly copy data from the region.
- To restore device state, write data_offset and data_size in the structure
  and write data in the region.
- To get dirty page bitmap, write start address and pfn count then read count of
  pfns copied and accordingly read those from the rest of the region or mmaped
  part of the region. This copy is iterated till page bitmap for all requested
  pfns are copied.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 714 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  20 ++
 3 files changed, 735 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index abad8b818c9b..36033d1437c5 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,4 +1,4 @@
-obj-y += common.o spapr.o
+obj-y += common.o spapr.o migration.o
 obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_VFIO_PLATFORM) += platform.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..d7b6d972c043
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,714 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2018
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.buffer.size) {
+        vfio_region_exit(&migration->region.buffer);
+        vfio_region_finalize(&migration->region.buffer);
+    }
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    Object *obj = NULL;
+    int ret = -EINVAL;
+
+    if (!migration) {
+        return ret;
+    }
+
+    /* Migration support added for PCI device only */
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        obj = vfio_pci_get_object(vbasedev);
+    }
+
+    if (!obj) {
+        return ret;
+    }
+
+    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
+                            migration->region.index, "migration");
+    if (ret) {
+        error_report("Failed to setup VFIO migration region %d: %s",
+                      migration->region.index, strerror(-ret));
+        goto err;
+    }
+
+    if (!migration->region.buffer.size) {
+        ret = -EINVAL;
+        error_report("Invalid region size of VFIO migration region %d: %s",
+                     migration->region.index, strerror(-ret));
+        goto err;
+    }
+
+    if (migration->region.buffer.mmaps) {
+        ret = vfio_region_mmap(&migration->region.buffer);
+        if (ret) {
+            error_report("Failed to mmap VFIO migration region %d: %s",
+                         migration->region.index, strerror(-ret));
+            goto err;
+        }
+    }
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    int ret = 0;
+
+    ret = pwrite(vbasedev->fd, &state, sizeof(state),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("Failed to set migration state %d %s",
+                     ret, strerror(errno));
+        return ret;
+    }
+
+    vbasedev->device_state = state;
+    return 0;
+}
+
+void vfio_get_dirty_page_list(VFIODevice *vbasedev,
+                              uint64_t start_pfn,
+                              uint64_t pfn_count,
+                              uint64_t page_size)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint64_t count = 0, copied_pfns = 0;
+    int ret;
+
+    ret = pwrite(vbasedev->fd, &start_pfn, sizeof(start_pfn),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              start_pfn));
+    if (ret < 0) {
+        error_report("Failed to set dirty pages start address %d %s",
+                ret, strerror(errno));
+        return;
+    }
+
+    ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              page_size));
+    if (ret < 0) {
+        error_report("Failed to set dirty page size %d %s",
+                ret, strerror(errno));
+        return;
+    }
+
+    ret = pwrite(vbasedev->fd, &pfn_count, sizeof(pfn_count),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              total_pfns));
+    if (ret < 0) {
+        error_report("Failed to set dirty page total pfns %d %s",
+                ret, strerror(errno));
+        return;
+    }
+
+    do {
+        uint64_t bitmap_size;
+        void *buf = NULL;
+        bool buffer_mmaped = false;
+
+        /* Read dirty_pfns.copied */
+        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             copied_pfns));
+        if (ret < 0) {
+            error_report("Failed to get dirty pages bitmap count %d %s",
+                    ret, strerror(errno));
+            return;
+        }
+
+        if (copied_pfns == 0) {
+            /*
+             * copied_pfns could be 0 if driver doesn't have any page to report
+             * dirty in given range
+             */
+            break;
+        }
+
+        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
+
+        if (region->mmaps) {
+            int i;
+
+            for (i = 0; i < region->nr_mmaps; i++) {
+                if (region->mmaps[i].size >= bitmap_size) {
+                    buf = region->mmaps[i].mmap;
+                    buffer_mmaped = true;
+                    break;
+                }
+            }
+        }
+
+        if (!buffer_mmaped) {
+            buf = g_malloc0(bitmap_size);
+
+            ret = pread(vbasedev->fd, buf, bitmap_size,
+                        region->fd_offset +
+                        sizeof(struct vfio_device_migration_info) + 1);
+            if (ret != bitmap_size) {
+                error_report("Failed to get dirty pages bitmap %d", ret);
+                g_free(buf);
+                return;
+            }
+        }
+
+        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
+                                               (start_pfn + count) * page_size,
+                                                copied_pfns);
+        count +=  copied_pfns;
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+    } while (count < pfn_count);
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        vfio_pci_save_config(vbasedev, f);
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    return qemu_file_get_error(f);
+}
+
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        vfio_pci_load_config(vbasedev, f);
+    }
+
+    if (qemu_get_be64(f) != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("Wrong end of block while loading device config space");
+        return -EINVAL;
+    }
+
+    return qemu_file_get_error(f);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    if (vbasedev->vm_running) {
+        ret = vfio_migration_set_state(vbasedev,
+                         VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
+        if (ret) {
+            error_report("Failed to set state RUNNING and SAVING");
+        }
+    } else {
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
+        if (ret) {
+            error_report("Failed to set state STOP and SAVING");
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint64_t data_offset = 0, data_size = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+    if (ret != sizeof(data_offset)) {
+        error_report("Failed to get migration buffer data offset %d",
+                     ret);
+        return -EINVAL;
+    }
+
+    if (migration->pending_bytes) {
+        void *buf = NULL;
+        bool buffer_mmaped = false;
+
+        if (region->mmaps) {
+            int i;
+
+            for (i = 0; i < region->nr_mmaps; i++) {
+                if ((data_offset >= region->mmaps[i].offset) &&
+                    (data_offset < region->mmaps[i].offset +
+                                   region->mmaps[i].size)) {
+                    uint64_t region_data_size = region->mmaps[i].size -
+                                        (data_offset - region->mmaps[i].offset);
+
+                    buf = region->mmaps[i].mmap;
+                    buffer_mmaped = true;
+
+                    if (migration->pending_bytes > region_data_size) {
+                        data_size = region_data_size;
+                    } else {
+                        data_size = migration->pending_bytes;
+                    }
+                    break;
+                }
+            }
+        }
+
+        if (!buffer_mmaped) {
+            uint64_t region_data_size = region->size - data_offset;
+
+            if (migration->pending_bytes > region_data_size) {
+                data_size = region_data_size;
+            } else {
+                data_size = migration->pending_bytes;
+            }
+
+            buf = g_malloc0(data_size);
+            ret = pread(vbasedev->fd, buf, data_size,
+                        region->fd_offset + data_offset);
+            if (ret != data_size) {
+                error_report("Failed to get migration data %d", ret);
+                return -EINVAL;
+            }
+        }
+
+        qemu_put_be64(f, data_size);
+        qemu_put_buffer(f, buf, data_size);
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+        migration->pending_bytes -= data_size;
+    } else {
+        qemu_put_be64(f, migration->pending_bytes);
+    }
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return data_size;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+
+    ret = vfio_save_buffer(f, vbasedev);
+    if (ret < 0) {
+        error_report("vfio_save_buffer failed %s",
+                     strerror(errno));
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return ret;
+}
+
+static int vfio_update_pending(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint64_t pending_bytes = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending_bytes));
+    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
+        error_report("Failed to get pending bytes %d", ret);
+        migration->pending_bytes = 0;
+        return (ret < 0) ? ret : -EINVAL;
+    }
+
+    migration->pending_bytes = pending_bytes;
+    return 0;
+}
+
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return;
+    }
+
+    if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
+        *res_precopy_only += migration->pending_bytes;
+    } else {
+        *res_postcopy_only += migration->pending_bytes;
+    }
+    *res_compatible += 0;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("Failed to set state STOP and SAVING");
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    while (migration->pending_bytes > 0) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+        ret = vfio_save_buffer(f, vbasedev);
+        if (ret < 0) {
+            error_report("Failed to save buffer");
+            return ret;
+        } else if (ret == 0) {
+            break;
+        }
+
+        if (migration->pending_bytes == 0) {
+            ret = vfio_update_pending(vbasedev);
+            if (ret) {
+                return ret;
+            }
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, 0);
+    if (ret) {
+        error_report("Failed to set state STOPPED");
+        return ret;
+    }
+    return ret;
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+    uint64_t data, data_size;
+
+    ret = vfio_migration_set_state(vbasedev, 0);
+    if (ret) {
+        error_report("Failed to set state RESUMING");
+        return ret;
+    }
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        if (data == VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+        } else if (data == VFIO_MIG_FLAG_DEV_SETUP_STATE) {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return 0;
+            } else {
+                error_report("SETUP STATE: EOS not found 0x%lx", data);
+                return -EINVAL;
+            }
+        } else if (data == VFIO_MIG_FLAG_DEV_DATA_STATE) {
+            VFIOMigration *migration = vbasedev->migration;
+            VFIORegion *region = &migration->region.buffer;
+            void *buf = NULL;
+            bool buffer_mmaped = false;
+            uint64_t data_offset = 0;
+
+            data_size = qemu_get_be64(f);
+            if (data_size != 0) {
+                if (region->mmaps) {
+                    int i;
+
+                    for (i = 0; i < region->nr_mmaps; i++) {
+                        if (region->mmaps[i].mmap &&
+                            (region->mmaps[i].size >= data_size)) {
+                            buf = region->mmaps[i].mmap;
+                            data_offset = region->mmaps[i].offset;
+                            buffer_mmaped = true;
+                            break;
+                        }
+                    }
+                }
+
+                if (!buffer_mmaped) {
+                    buf = g_malloc0(data_size);
+                    data_offset = sizeof(struct vfio_device_migration_info) + 1;
+                }
+
+                qemu_get_buffer(f, buf, data_size);
+
+                ret = pwrite(vbasedev->fd, &data_offset, sizeof(data_offset),
+                             region->fd_offset +
+                      offsetof(struct vfio_device_migration_info, data_offset));
+                if (ret != sizeof(data_offset)) {
+                    error_report("Failed to set migration data offset %d",
+                                  ret);
+                    return -EINVAL;
+                }
+
+                ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
+                             region->fd_offset +
+                        offsetof(struct vfio_device_migration_info, data_size));
+                if (ret != sizeof(data_size)) {
+                    error_report("Failed to set migration buffer data size %d",
+                                 ret);
+                    return -EINVAL;
+                }
+
+                if (!buffer_mmaped) {
+                    ret = pwrite(vbasedev->fd, buf, data_size,
+                                 region->fd_offset + data_offset);
+                    g_free(buf);
+
+                    if (ret != data_size) {
+                        error_report("Failed to set migration buffer %d", ret);
+                        return -EINVAL;
+                    }
+                }
+            }
+        }
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+        data = qemu_get_be64(f);
+    }
+
+    return 0;
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .load_state = vfio_load_state,
+};
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running) && running) {
+        int ret;
+
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("Failed to set state RUNNING");
+        }
+    }
+
+    vbasedev->vm_running = running;
+}
+
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    switch (s->state) {
+    case MIGRATION_STATUS_ACTIVE:
+        if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
+            if (vbasedev->vm_running) {
+                ret = vfio_migration_set_state(vbasedev,
+                          VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
+                if (ret) {
+                    error_report("Failed to set state RUNNING and SAVING");
+                }
+            } else {
+                ret = vfio_migration_set_state(vbasedev,
+                                               VFIO_DEVICE_STATE_SAVING);
+                if (ret) {
+                    error_report("Failed to set state STOP and SAVING");
+                }
+            }
+        } else {
+            ret = vfio_migration_set_state(vbasedev, 0);
+            if (ret) {
+                error_report("Failed to set state RESUMING");
+            }
+        }
+        return;
+
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("Failed to set state RUNNING");
+        }
+        return;
+    }
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+                               struct vfio_region_info *info)
+{
+    int ret;
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+    vbasedev->migration->region.index = info->index;
+
+    ret = vfio_migration_region_init(vbasedev);
+    if (ret) {
+        error_report("Failed to initialise migration region");
+        return ret;
+    }
+
+    register_savevm_live(NULL, "vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
+
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
+
+    return 0;
+}
+
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_region_info *info;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    if (ret) {
+        Error *local_err = NULL;
+
+        error_setg(&vbasedev->migration_blocker,
+                   "VFIO device doesn't support migration");
+        ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            error_free(vbasedev->migration_blocker);
+            return ret;
+        }
+    } else {
+        return vfio_migration_init(vbasedev, info);
+    }
+
+    return 0;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (!vbasedev->migration) {
+        return;
+    }
+
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    g_free(vbasedev->migration);
+}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 7624c9f511c4..3b8c98b29baf 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -30,6 +30,7 @@
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
+#include "sysemu/sysemu.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -58,6 +59,14 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    struct {
+        VFIORegion buffer;
+        uint32_t index;
+    } region;
+    uint64_t pending_bytes;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -119,6 +128,12 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    uint32_t device_state;
+    VMChangeStateEntry *vm_state;
+    int vm_running;
+    Notifier migration_state;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -198,4 +213,9 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
+                               uint64_t pfn_count, uint64_t page_size);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v3 4/5] Add vfio_listerner_log_sync to mark dirty pages
  2019-02-19 21:23 [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Kirti Wankhede
                   ` (2 preceding siblings ...)
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices Kirti Wankhede
@ 2019-02-19 21:23 ` Kirti Wankhede
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 5/5] Make vfio-pci device migration capable Kirti Wankhede
  2019-02-20 10:22 ` [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Dr. David Alan Gilbert
  5 siblings, 0 replies; 17+ messages in thread
From: Kirti Wankhede @ 2019-02-19 21:23 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel, Kirti Wankhede

vfio_listerner_log_sync gets list of dirty pages from vendor driver and mark
those pages dirty.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4262b80c4450..84ba6808f7d0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -36,6 +36,7 @@
 #include "sysemu/kvm.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -698,9 +699,39 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_listerner_log_sync(MemoryListener *listener,
+                                    MemoryRegionSection *section)
+{
+    uint64_t start_addr, size, pfn_count;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
+                continue;
+            } else {
+                return;
+            }
+        }
+    }
+
+    start_addr = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    size = int128_get64(section->size);
+    pfn_count = size >> TARGET_PAGE_BITS;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vfio_get_dirty_page_list(vbasedev, start_addr >> TARGET_PAGE_BITS,
+                                     pfn_count, TARGET_PAGE_SIZE);
+        }
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v3 5/5] Make vfio-pci device migration capable.
  2019-02-19 21:23 [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Kirti Wankhede
                   ` (3 preceding siblings ...)
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 4/5] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
@ 2019-02-19 21:23 ` Kirti Wankhede
  2019-02-20 10:22 ` [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Dr. David Alan Gilbert
  5 siblings, 0 replies; 17+ messages in thread
From: Kirti Wankhede @ 2019-02-19 21:23 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel, Kirti Wankhede

Call vfio_migration_probe() and vfio_migration_finalize() functions for
vfio-pci device to enable migration for vfio PCI device.
Removed vfio_pci_vmstate structure.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e87a8a03d3f3..0fe42d146006 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2939,6 +2939,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = &vdev->pdev.qdev;
+    vdev->vbasedev.device_state = 0;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3175,6 +3176,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto out_teardown;
     }
 
+    ret = vfio_migration_probe(&vdev->vbasedev, errp);
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3213,6 +3216,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
 
+    vdev->vbasedev.device_state = 0;
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
@@ -3222,6 +3226,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
@@ -3328,11 +3333,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3340,7 +3340,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device
  2019-02-19 21:23 [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Kirti Wankhede
                   ` (4 preceding siblings ...)
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 5/5] Make vfio-pci device migration capable Kirti Wankhede
@ 2019-02-20 10:22 ` Dr. David Alan Gilbert
  2019-02-21  5:25   ` Kirti Wankhede
  5 siblings, 1 reply; 17+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-20 10:22 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, yulei.zhang, qemu-devel

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Add migration support for VFIO device

Hi Kirti,
  Can you explain how this compares and works with Yan Zhao's
set?
  These look like two incompatible solutions to me - if that's
the case we need to take a step back and figure out how to combine
them into one.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device
  2019-02-20 10:22 ` [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Dr. David Alan Gilbert
@ 2019-02-21  5:25   ` Kirti Wankhede
  2019-02-21  5:52     ` Tian, Kevin
  0 siblings, 1 reply; 17+ messages in thread
From: Kirti Wankhede @ 2019-02-21  5:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: alex.williamson, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, yulei.zhang, qemu-devel



On 2/20/2019 3:52 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> Add migration support for VFIO device
> 
> Hi Kirti,
>   Can you explain how this compares and works with Yan Zhao's
> set?

This patch set is incremental version of my previous patch set:
https://patchwork.ozlabs.org/cover/1000719/
This takes care of the feedbacks received on previous version.

This patch set is different than Yan Zhao's set.

Thanks,
Kirti

>   These look like two incompatible solutions to me - if that's
> the case we need to take a step back and figure out how to combine
> them into one.
> 
> Dave
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device
  2019-02-21  5:25   ` Kirti Wankhede
@ 2019-02-21  5:52     ` Tian, Kevin
  2019-02-21  7:10       ` Neo Jia
  0 siblings, 1 reply; 17+ messages in thread
From: Tian, Kevin @ 2019-02-21  5:52 UTC (permalink / raw)
  To: Kirti Wankhede, Dr. David Alan Gilbert
  Cc: alex.williamson, cjia, Yang, Ziye, Liu, Changpeng, Liu, Yi L,
	mlevitsk, eskultet, cohuck, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang, Zhi A, Zhao,
	Yan Y, yulei.zhang, qemu-devel

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, February 21, 2019 1:25 PM
> 
> On 2/20/2019 3:52 PM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >> Add migration support for VFIO device
> >
> > Hi Kirti,
> >   Can you explain how this compares and works with Yan Zhao's
> > set?
> 
> This patch set is incremental version of my previous patch set:
> https://patchwork.ozlabs.org/cover/1000719/
> This takes care of the feedbacks received on previous version.
> 
> This patch set is different than Yan Zhao's set.
> 

I can help give some background about Yan's work:

There was a big gap between Kirti's last version and the overall review
comments, especially this one:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg576652.html

Then there was no reply from Kirti whether she agreed with the comments
and was working on a new version.

Then we think we should jump in to keep the ball moving, based on
a fresh implementation according to recommended direction, i.e. focusing
on device state management instead of sticking to migration flow in kernel
API design.

and also more importantly we provided kernel side implementation based
on Intel GVT-g to give the whole picture of both user/kernel side changes.
That should give people a better understanding of how those new APIs
are expected to be used by Qemu, and to be implemented by vendor driver.

That is why Yan just shared her work.

Now it's great to see that Kirti is still actively working on this effort and is
also moving toward the right direction. Let's have a close look at two
implementations and then choose a cleaner one as base for future
enhancements. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device
  2019-02-21  5:52     ` Tian, Kevin
@ 2019-02-21  7:10       ` Neo Jia
  2019-02-27  1:51         ` Tian, Kevin
  0 siblings, 1 reply; 17+ messages in thread
From: Neo Jia @ 2019-02-21  7:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Dr. David Alan Gilbert, alex.williamson, Yang,
	Ziye, Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe, Zhengxiao.zx,
	shuangtai.tst, Ken.Xue, Wang, Zhi A, Zhao, Yan Y, yulei.zhang,
	qemu-devel

On Thu, Feb 21, 2019 at 05:52:53AM +0000, Tian, Kevin wrote:
> > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > Sent: Thursday, February 21, 2019 1:25 PM
> > 
> > On 2/20/2019 3:52 PM, Dr. David Alan Gilbert wrote:
> > > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > >> Add migration support for VFIO device
> > >
> > > Hi Kirti,
> > >   Can you explain how this compares and works with Yan Zhao's
> > > set?
> > 
> > This patch set is incremental version of my previous patch set:
> > https://patchwork.ozlabs.org/cover/1000719/
> > This takes care of the feedbacks received on previous version.
> > 
> > This patch set is different than Yan Zhao's set.
> > 
> 
> I can help give some background about Yan's work:
> 
> There was a big gap between Kirti's last version and the overall review
> comments, especially this one:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg576652.html

Hi Kevin,

> 
> Then there was no reply from Kirti whether she agreed with the comments
> and was working on a new version.

Sorry, we should ack on those comments when we have received them last time.

> 
> Then we think we should jump in to keep the ball moving, based on
> a fresh implementation according to recommended direction, i.e. focusing
> on device state management instead of sticking to migration flow in kernel
> API design.
> 
> and also more importantly we provided kernel side implementation based
> on Intel GVT-g to give the whole picture of both user/kernel side changes.
> That should give people a better understanding of how those new APIs
> are expected to be used by Qemu, and to be implemented by vendor driver.
> 
> That is why Yan just shared her work.

Really glad to see the v2 version works for you guys, appreciate for the driver
side changes.

> 
> Now it's great to see that Kirti is still actively working on this effort and is
> also moving toward the right direction. Let's have a close look at two
> implementations and then choose a cleaner one as base for future
> enhancements. :-)

Yes, the v3 has addressed all the comments / concerns raised in the v2, I think
we should take a look and keep moving.

Just a quick thought - would be possible / better to have Kirti focus on the QEMU 
patches and Yan take care GVT-g kernel driver side changes? This will give us
the best testing coverage. Hope I don't step on anybody's toes here. ;-)

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface Kirti Wankhede
@ 2019-02-21 22:23   ` Alex Williamson
  2019-02-26 20:05     ` Kirti Wankhede
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Williamson @ 2019-02-21 22:23 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel

On Wed, 20 Feb 2019 02:53:16 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> - Used 2 bits to define VFIO device states.
>     Bit 0 => 0/1 => _STOPPED/_RUNNING
>     Bit 1 => 0/1 => _RESUMING/_SAVING
>     Combination of these bits defines VFIO device's state during migration
>     _RUNNING => Normal VFIO device running state.
>     _STOPPED => VFIO device stopped.
>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>                           saving state of device i.e. pre-copy state
>     _SAVING | _STOPPED => vCPUs are stoppped, VFIO device should be stopped, and
>                           save device state,i.e. stop-n-copy state
>     _RESUMING => VFIO device resuming state.

Shouldn't we have a non-_RESUMING/_SAVING run state?  If these are
indicating directly flow, maybe we need two bits:

  00b - None, normal runtime
  01b - Saving
  10b - Resuming
  11b - Invalid/reserved (maybe a Failed state indicator)

> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access:
>     * device_state: (write only)
>         To convey VFIO device state to be transitioned to.

Seems trivial and potentially useful to support read here, we have 30
(or maybe 29) bits yet to define.

>     * pending bytes: (read only)
>         To get pending bytes yet to be migrated for VFIO device
>     * data_offset: (read/write)
>         To get or set data offset in migration from where data exist
>         during _SAVING and _RESUMING state

What's the use case for writing this?

>     * data_size: (write only)
>         To convey size of data copied in migration region during _RESUMING
>         state

How to know how much is available for read?

>     * start_pfn, page_size, total_pfns: (write only)
>         To get bitmap of dirty pages from vendor driver from given
>         start address for total_pfns.

What would happen if a user wrote in 1MB for page size?  Is the vendor
driver expected to support arbitrary page sizes?  Are we only trying to
convey the page size and would that page size ever be other than
getpagesize()?

>     * copied_pfns: (read only)
>         To get number of pfns bitmap copied in migration region.
>         Vendor driver should copy the bitmap with bits set only for
>         pages to be marked dirty in migration region. Vendor driver
>         should return 0 if there are 0 pages dirty in requested
>         range.

This is useful, but I wonder if it's really a required feature for the
vendor driver.  For instance, with mdev IOMMU support we could wrap an
arbitrary PCI device as mdev, but we don't necessarily have dirty page
tracking.  Would a device need to report -1 here if it wanted to
indicate any page could be dirty if we only know how to collect the
state of the device itself for migration (ie. force the device to be
stopped first).
 
> Migration region looks like:
>  ------------------------------------------------------------------
> |vfio_device_migration_info|    data section                      |
> |                          |     ///////////////////////////////  |
>  ------------------------------------------------------------------
                             ^ what's this?

>  ^                              ^                              ^
>  offset 0-trapped part        data.offset                 data.size

Isn't data.size above really (data.offset + data.size)?  '.' vs '_'
inconsistency vs above.
 
> Data section is always followed by vfio_device_migration_info
> structure in the region, so data.offset will always be none-0.

This seems exactly backwards from the diagram, data section follows
vfio_device_migration_info.  Also, "non-zero".

> Offset from where data is copied is decided by kernel driver, data

But data_offset is listed as read-write.

> section can be trapped or mapped depending on how kernel driver
> defines data section. If mmapped, then data.offset should be page
> aligned, where as initial section which contain
> vfio_device_migration_info structure might not end at offset which
> is page aligned.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 12a7b1dc53c8..1b12a9b95e00 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -368,6 +368,71 @@ struct vfio_region_gfx_edid {
>   */
>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>  
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise should return error.
> + *
> + * device_state: (write only)
> + *      To indicate vendor driver the state VFIO device should be transitioned
> + *      to. If device state transition fails, write to this field return error.
> + *      It consists of 2 bits.
> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> + *        device before write returns.
> + *      - If bit 1 set, indicates _SAVING state. When its reset, that indicates
> + *        _RESUMING state.
> + *
> + * pending bytes: (read only)
> + *      Read pending bytes yet to be migrated from vendor driver

Is this essentially a volatile value, changing based on data previously
copied and device activity?

> + *
> + * data_offset: (read/write)
> + *      User application should read data_offset in migration region from where
> + *      user application should read data during _SAVING state.
> + *      User application would write data_offset in migration region from where
> + *      user application is had written data during _RESUMING state.

Why wouldn't data_offset simply be fixed?  Do we really want to support
the user writing a arbitrary offsets in the migration region?  Each
chunk can simply start at the kernel provided data_offset.

> + *
> + * data_size: (write only)
> + *      User application should write size of data copied in migration region
> + *      during _RESUMING state.

How much does the user read on _SAVING then?

> + *
> + * start_pfn: (write only)
> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> + *      _SAVING state.
> + *
> + * page_size: (write only)
> + *      User application should write the page_size of pfn.
> + *
> + * total_pfns: (write only)
> + *      Total pfn count from start_pfn for which dirty bitmap is requested.

So effectively the area that begins at data_offset is dual purpose
(triple purpose vs Yan's, config, memory, and dirty bitmap) but the
protocol isn't entirely evident to me. I think we need to write it out
as Yan provided.  If I'm in the _SAVING state, do I simply read from
data_offset to min(data_size, region.size - data_offset)?  If that area
is mmap'd, how does the user indicate to the kernel to prepare the data
or that X bytes were acquired?  In the _RESUMING state, do I write from
data_offset to min(data_size, region.size - data_offset) and indicate
the write using data_size?

> + *
> + * copied_pfns: (read only)
> + *      pfn count for which dirty bitmap is copied to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
> + *      range.
> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +        __u32 reserved;
> +        __u64 pending_bytes;
> +        __u64 data_offset;
> +        __u64 data_size;
> +        __u64 start_pfn;
> +        __u64 page_size;
> +        __u64 total_pfns;
> +        __u64 copied_pfns;
> +} __attribute__((packed));
> +

As you mentioned, and with Yan's version, we still need to figure out
something with compatibility and versioning.  Thanks,

Alex

>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices
  2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices Kirti Wankhede
@ 2019-02-21 22:38   ` Alex Williamson
  2019-02-26 20:16     ` Kirti Wankhede
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Williamson @ 2019-02-21 22:38 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel

On Wed, 20 Feb 2019 02:53:18 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supports migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read pending_bytes and data_offset using structure
>   vfio_device_migration_info, accordingly copy data from the region.
> - To restore device state, write data_offset and data_size in the structure
>   and write data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count of
>   pfns copied and accordingly read those from the rest of the region or mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 714 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  20 ++
>  3 files changed, 735 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index abad8b818c9b..36033d1437c5 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,4 +1,4 @@
> -obj-y += common.o spapr.o
> +obj-y += common.o spapr.o migration.o
>  obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_VFIO_PLATFORM) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..d7b6d972c043
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,714 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2018
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO

:^)

> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.buffer.size) {
> +        vfio_region_exit(&migration->region.buffer);
> +        vfio_region_finalize(&migration->region.buffer);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!migration) {
> +        return ret;
> +    }
> +
> +    /* Migration support added for PCI device only */
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        obj = vfio_pci_get_object(vbasedev);
> +    }
> +
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
> +                            migration->region.index, "migration");
> +    if (ret) {
> +        error_report("Failed to setup VFIO migration region %d: %s",
> +                      migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.buffer.size) {
> +        ret = -EINVAL;
> +        error_report("Invalid region size of VFIO migration region %d: %s",
> +                     migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (migration->region.buffer.mmaps) {
> +        ret = vfio_region_mmap(&migration->region.buffer);
> +        if (ret) {
> +            error_report("Failed to mmap VFIO migration region %d: %s",
> +                         migration->region.index, strerror(-ret));
> +            goto err;
> +        }

In patch 5 this eventually gets called from our realize function, so
we're forcing the kernel to allocate whatever backing store it might
use for these mmaps for the entire life of the device, in case maybe we
want to migrate it.  Like my reply to Yan, unless I'm misunderstanding
how drivers will back this mmap this seems really inefficient.

> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    int ret = 0;
> +
> +    ret = pwrite(vbasedev->fd, &state, sizeof(state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("Failed to set migration state %d %s",
> +                     ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    vbasedev->device_state = state;
> +    return 0;
> +}
> +
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> +                              uint64_t start_pfn,
> +                              uint64_t pfn_count,
> +                              uint64_t page_size)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t count = 0, copied_pfns = 0;
> +    int ret;
> +
> +    ret = pwrite(vbasedev->fd, &start_pfn, sizeof(start_pfn),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              start_pfn));
> +    if (ret < 0) {
> +        error_report("Failed to set dirty pages start address %d %s",
> +                ret, strerror(errno));
> +        return;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              page_size));
> +    if (ret < 0) {
> +        error_report("Failed to set dirty page size %d %s",
> +                ret, strerror(errno));
> +        return;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &pfn_count, sizeof(pfn_count),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              total_pfns));
> +    if (ret < 0) {
> +        error_report("Failed to set dirty page total pfns %d %s",
> +                ret, strerror(errno));
> +        return;
> +    }
> +
> +    do {
> +        uint64_t bitmap_size;
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        /* Read dirty_pfns.copied */
> +        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             copied_pfns));
> +        if (ret < 0) {
> +            error_report("Failed to get dirty pages bitmap count %d %s",
> +                    ret, strerror(errno));
> +            return;
> +        }
> +
> +        if (copied_pfns == 0) {
> +            /*
> +             * copied_pfns could be 0 if driver doesn't have any page to report
> +             * dirty in given range
> +             */
> +            break;
> +        }
> +
> +        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
> +
> +        if (region->mmaps) {
> +            int i;
> +
> +            for (i = 0; i < region->nr_mmaps; i++) {
> +                if (region->mmaps[i].size >= bitmap_size) {
> +                    buf = region->mmaps[i].mmap;
> +                    buffer_mmaped = true;
> +                    break;
> +                }

Wait, the first sparse mmap region that's big enough to hold the bitmap
is necessarily the one where we read it?  Shouldn't we test that the
mmap range covers the correct offset?  This also makes me realize that
if a driver wanted the device memory mmap'd but not the dirty memory
backing, I don't think they can do it with this interface.  Perhaps
there should be separate data and dirty offset fields in the header to
that a driver can place them at separate offsets with different mmap
coverage if they desire.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface
  2019-02-21 22:23   ` Alex Williamson
@ 2019-02-26 20:05     ` Kirti Wankhede
  2019-03-01 13:36       ` Cornelia Huck
  2019-03-07 19:45       ` Alex Williamson
  0 siblings, 2 replies; 17+ messages in thread
From: Kirti Wankhede @ 2019-02-26 20:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel

Alex,

On 2/22/2019 3:53 AM, Alex Williamson wrote:
> On Wed, 20 Feb 2019 02:53:16 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Defined MIGRATION region type and sub-type.
>> - Used 2 bits to define VFIO device states.
>>     Bit 0 => 0/1 => _STOPPED/_RUNNING
>>     Bit 1 => 0/1 => _RESUMING/_SAVING
>>     Combination of these bits defines VFIO device's state during migration
>>     _RUNNING => Normal VFIO device running state.
>>     _STOPPED => VFIO device stopped.
>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>>                           saving state of device i.e. pre-copy state
>>     _SAVING | _STOPPED => vCPUs are stoppped, VFIO device should be stopped, and
>>                           save device state,i.e. stop-n-copy state
>>     _RESUMING => VFIO device resuming state.
> 
> Shouldn't we have a non-_RESUMING/_SAVING run state?  If these are
> indicating directly flow, maybe we need two bits:
> 
>   00b - None, normal runtime
>   01b - Saving
>   10b - Resuming
>   11b - Invalid/reserved (maybe a Failed state indicator)
> 

There has to be 2 more states:
_SAVING | _RUNNING => SAVING while device is RUNNING - pre-copy phase
_SAVING | _STOPPED => SAVING while device is STOPPED - stop-and-copy phase

So the 2 bits used in this patch are:
00b - _RESUMING
01b - _RUNNING - Normal Running
10b - _SAVING | _STOPPED - stop-and-copy phase
11b - _SAVING | _RUNNING - pre-copy phase


>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>   offset of migration region to get/set VFIO device related information.
>>   Defined members of structure and usage on read/write access:
>>     * device_state: (write only)
>>         To convey VFIO device state to be transitioned to.
> 
> Seems trivial and potentially useful to support read here, we have 30
> (or maybe 29) bits yet to define.
> 

Ok, those bits can be used later.

>>     * pending bytes: (read only)
>>         To get pending bytes yet to be migrated for VFIO device
>>     * data_offset: (read/write)
>>         To get or set data offset in migration from where data exist
>>         during _SAVING and _RESUMING state
> 
> What's the use case for writing this?
> 

During resuming, user space application (QEMU) writes data to migration
region. At that time QEMU writes data_offset from where data is written,
so that vendor driver can read data from that offset. If data section is
mmapped, data_offset is start of mmapped region where as if data section
is trapped, data_offset is sizeof(struct vfio_device_migration_info) +
1, just after vfio_device_migration_info structure.

>>     * data_size: (write only)
>>         To convey size of data copied in migration region during _RESUMING
>>         state
> 
> How to know how much is available for read?

pending_bytes tells how much is still to be read.

> 
>>     * start_pfn, page_size, total_pfns: (write only)
>>         To get bitmap of dirty pages from vendor driver from given
>>         start address for total_pfns.
> 
> What would happen if a user wrote in 1MB for page size?  Is the vendor
> driver expected to support arbitrary page sizes?  Are we only trying to
> convey the page size and would that page size ever be other than
> getpagesize()?
> 
>>     * copied_pfns: (read only)
>>         To get number of pfns bitmap copied in migration region.
>>         Vendor driver should copy the bitmap with bits set only for
>>         pages to be marked dirty in migration region. Vendor driver
>>         should return 0 if there are 0 pages dirty in requested
>>         range.
> 
> This is useful, but I wonder if it's really a required feature for the
> vendor driver.  For instance, with mdev IOMMU support we could wrap an
> arbitrary PCI device as mdev, but we don't necessarily have dirty page
> tracking.  Would a device need to report -1 here if it wanted to
> indicate any page could be dirty if we only know how to collect the
> state of the device itself for migration (ie. force the device to be
> stopped first).
>  

Does that mean if returned -1 here, mark all pages in the section as dirty?

>> Migration region looks like:
>>  ------------------------------------------------------------------
>> |vfio_device_migration_info|    data section                      |
>> |                          |     ///////////////////////////////  |
>>  ------------------------------------------------------------------
>                              ^ what's this?
> 
>>  ^                              ^                              ^
>>  offset 0-trapped part        data.offset                 data.size
> 
> Isn't data.size above really (data.offset + data.size)?  '.' vs '_'
> inconsistency vs above.
>  

Yes, it should be '_'. I'll correct that.


>> Data section is always followed by vfio_device_migration_info
>> structure in the region, so data.offset will always be none-0.
> 
> This seems exactly backwards from the diagram, data section follows
> vfio_device_migration_info.  Also, "non-zero".
> 
>> Offset from where data is copied is decided by kernel driver, data
> 
> But data_offset is listed as read-write.
> 

data_offset is read during _SAVING state so QEMU knows from where to
read data
data_offset is written during _RESUMING state so that QEMU can convey
offset in migration region to vendor driver from where data should be
considered as input data.

>> section can be trapped or mapped depending on how kernel driver
>> defines data section. If mmapped, then data.offset should be page
>> aligned, where as initial section which contain
>> vfio_device_migration_info structure might not end at offset which
>> is page aligned.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  linux-headers/linux/vfio.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 65 insertions(+)
>>
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 12a7b1dc53c8..1b12a9b95e00 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -368,6 +368,71 @@ struct vfio_region_gfx_edid {
>>   */
>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>>  
>> +/* Migration region type and sub-type */
>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>> +
>> +/**
>> + * Structure vfio_device_migration_info is placed at 0th offset of
>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>> + * information. Field accesses from this structure are only supported at their
>> + * native width and alignment, otherwise should return error.
>> + *
>> + * device_state: (write only)
>> + *      To indicate vendor driver the state VFIO device should be transitioned
>> + *      to. If device state transition fails, write to this field return error.
>> + *      It consists of 2 bits.
>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>> + *        device before write returns.
>> + *      - If bit 1 set, indicates _SAVING state. When its reset, that indicates
>> + *        _RESUMING state.
>> + *
>> + * pending bytes: (read only)
>> + *      Read pending bytes yet to be migrated from vendor driver
> 
> Is this essentially a volatile value, changing based on data previously
> copied and device activity?

Yes.

> 
>> + *
>> + * data_offset: (read/write)
>> + *      User application should read data_offset in migration region from where
>> + *      user application should read data during _SAVING state.
>> + *      User application would write data_offset in migration region from where
>> + *      user application is had written data during _RESUMING state.
> 
> Why wouldn't data_offset simply be fixed?  Do we really want to support
> the user writing a arbitrary offsets in the migration region?  Each
> chunk can simply start at the kernel provided data_offset.
>

data_offset differs based on region is trapped on mapped. In case when
region is mmaped, QEMU writes data to mapped region and then write
data_offset field, indicating data is present in mmaped buffer. Read
below for more detailed steps.

>> + *
>> + * data_size: (write only)
>> + *      User application should write size of data copied in migration region
>> + *      during _RESUMING state.
> 
> How much does the user read on _SAVING then?
> 

pending_bytes tells bytes that should be read on _SAVING.

>> + *
>> + * start_pfn: (write only)
>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
>> + *      _SAVING state.
>> + *
>> + * page_size: (write only)
>> + *      User application should write the page_size of pfn.
>> + *
>> + * total_pfns: (write only)
>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> 
> So effectively the area that begins at data_offset is dual purpose
> (triple purpose vs Yan's, config, memory, and dirty bitmap) but the
> protocol isn't entirely evident to me. I think we need to write it out
> as Yan provided.  If I'm in the _SAVING state, do I simply read from
> data_offset to min(data_size, region.size - data_offset)?  If that area
> is mmap'd, how does the user indicate to the kernel to prepare the data
> or that X bytes were acquired?  In the _RESUMING state, do I write from
> data_offset to min(data_size, region.size - data_offset) and indicate
> the write using data_size?
> 

Let me list down the steps for each state, hope that helps to answer all
above questions.

In _SAVING|_RUNNING device state:
- read pending_bytes
- read data_offset - indicates kernel driver to write data to staging
buffer which is mmapped.
- if data section is trapped, pread() from data_offset to
min(pending_bytes, remaining_region)
- if data section is mmaped, read mmaped buffer of size
min(pending_bytes, remaining_region)
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }


In _SAVING|_STOPPED device state:
a. read config space of device and save to migration file stream. This
doesn't need to be from vendor driver. Any other special config state
from driver can be saved as data in following iteration.
b. read pending_bytes - indicates kernel driver to write data to staging
buffer which is mmapped.
c. if data section is trapped, pread() from data_offset to
min(pending_bytes, remaining_region)
d. if data section is mmaped, read mmaped buffer of size
min(pending_bytes, remaining_region)
e. Write data packet as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
f. iterate through steps b to e until (pending_bytes > 0)
g. Write {VFIO_MIG_FLAG_END_OF_STATE}


Dirty page tracking (.log_sync) is part of RAM copying state, where
vendor driver provides the bitmap of pages which are dirtied by vendor
driver through migration region and as part of RAM copy, those pages
gets copied to file stream.


In _RESUMING device state:
- load config state.
- For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
    - read data_size from packet, read buffer of data_size
        if region is mmaped, write data of data_size to mmaped region.
    - write data_offset and data_size.
        In case of mmapped region, write to data_size indicates kernel
driver that data is written in staging buffer.
    - if region is trapped, pwrite() data of data_size from data_offset.


>> + *
>> + * copied_pfns: (read only)
>> + *      pfn count for which dirty bitmap is copied to migration region.
>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
>> + *      marked dirty in migration region.
>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
>> + *      range.
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +        __u32 device_state;         /* VFIO device state */
>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>> +        __u32 reserved;
>> +        __u64 pending_bytes;
>> +        __u64 data_offset;
>> +        __u64 data_size;
>> +        __u64 start_pfn;
>> +        __u64 page_size;
>> +        __u64 total_pfns;
>> +        __u64 copied_pfns;
>> +} __attribute__((packed));
>> +
> 
> As you mentioned, and with Yan's version, we still need to figure out
> something with compatibility and versioning.  Thanks,
> 

Yes, we still need to figure out compatibility and versioning.

Thanks,
Kirti

> Alex
> 
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>   * which allows direct access to non-MSIX registers which happened to be within
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices
  2019-02-21 22:38   ` Alex Williamson
@ 2019-02-26 20:16     ` Kirti Wankhede
  0 siblings, 0 replies; 17+ messages in thread
From: Kirti Wankhede @ 2019-02-26 20:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel



On 2/22/2019 4:08 AM, Alex Williamson wrote:
> On Wed, 20 Feb 2019 02:53:18 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
>> - Added SaveVMHandlers and implemented all basic functions required for live
>>   migration.
>> - Added VM state change handler to know running or stopped state of VM.
>> - Added migration state change notifier to get notification on migration state
>>   change. This state is translated to VFIO device state and conveyed to vendor
>>   driver.
>> - VFIO device supports migration or not is decided based of migration region
>>   query. If migration region query is successful then migration is supported
>>   else migration is blocked.
>> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>>   region and should always trapped by VFIO device's driver. Added both type of
>>   access support, trapped or mmapped, for data section of the region.
>> - To save device state, read pending_bytes and data_offset using structure
>>   vfio_device_migration_info, accordingly copy data from the region.
>> - To restore device state, write data_offset and data_size in the structure
>>   and write data in the region.
>> - To get dirty page bitmap, write start address and pfn count then read count of
>>   pfns copied and accordingly read those from the rest of the region or mmaped
>>   part of the region. This copy is iterated till page bitmap for all requested
>>   pfns are copied.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/Makefile.objs         |   2 +-
>>  hw/vfio/migration.c           | 714 ++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |  20 ++
>>  3 files changed, 735 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/vfio/migration.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index abad8b818c9b..36033d1437c5 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,4 +1,4 @@
>> -obj-y += common.o spapr.o
>> +obj-y += common.o spapr.o migration.o
>>  obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
>>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>>  obj-$(CONFIG_VFIO_PLATFORM) += platform.o
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> new file mode 100644
>> index 000000000000..d7b6d972c043
>> --- /dev/null
>> +++ b/hw/vfio/migration.c
>> @@ -0,0 +1,714 @@
>> +/*
>> + * Migration support for VFIO devices
>> + *
>> + * Copyright NVIDIA, Inc. 2018
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2. See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "cpu.h"
>> +#include "migration/migration.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/register.h"
>> +#include "migration/blocker.h"
>> +#include "migration/misc.h"
>> +#include "qapi/error.h"
>> +#include "exec/ramlist.h"
>> +#include "exec/ram_addr.h"
>> +#include "pci.h"
>> +
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
> 
> :^)
> 
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>> +
>> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (!migration) {
>> +        return;
>> +    }
>> +
>> +    if (migration->region.buffer.size) {
>> +        vfio_region_exit(&migration->region.buffer);
>> +        vfio_region_finalize(&migration->region.buffer);
>> +    }
>> +}
>> +
>> +static int vfio_migration_region_init(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    Object *obj = NULL;
>> +    int ret = -EINVAL;
>> +
>> +    if (!migration) {
>> +        return ret;
>> +    }
>> +
>> +    /* Migration support added for PCI device only */
>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> +        obj = vfio_pci_get_object(vbasedev);
>> +    }
>> +
>> +    if (!obj) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
>> +                            migration->region.index, "migration");
>> +    if (ret) {
>> +        error_report("Failed to setup VFIO migration region %d: %s",
>> +                      migration->region.index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    if (!migration->region.buffer.size) {
>> +        ret = -EINVAL;
>> +        error_report("Invalid region size of VFIO migration region %d: %s",
>> +                     migration->region.index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    if (migration->region.buffer.mmaps) {
>> +        ret = vfio_region_mmap(&migration->region.buffer);
>> +        if (ret) {
>> +            error_report("Failed to mmap VFIO migration region %d: %s",
>> +                         migration->region.index, strerror(-ret));
>> +            goto err;
>> +        }
> 
> In patch 5 this eventually gets called from our realize function, so
> we're forcing the kernel to allocate whatever backing store it might
> use for these mmaps for the entire life of the device, in case maybe we
> want to migrate it.  Like my reply to Yan, unless I'm misunderstanding
> how drivers will back this mmap this seems really inefficient.
> 

Backing storage can be allocated on access, that is from fault handler,
but will remain allocated for entire life of device.

This can be moved to .save_setup and unmmap from .save_cleanup,
similarly during resuming .load_setup and .load_cleanup can be used.
I'll update patch in next iteration.

>> +    }
>> +
>> +    return 0;
>> +
>> +err:
>> +    vfio_migration_region_exit(vbasedev);
>> +    return ret;
>> +}
>> +
>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    int ret = 0;
>> +
>> +    ret = pwrite(vbasedev->fd, &state, sizeof(state),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              device_state));
>> +    if (ret < 0) {
>> +        error_report("Failed to set migration state %d %s",
>> +                     ret, strerror(errno));
>> +        return ret;
>> +    }
>> +
>> +    vbasedev->device_state = state;
>> +    return 0;
>> +}
>> +
>> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
>> +                              uint64_t start_pfn,
>> +                              uint64_t pfn_count,
>> +                              uint64_t page_size)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint64_t count = 0, copied_pfns = 0;
>> +    int ret;
>> +
>> +    ret = pwrite(vbasedev->fd, &start_pfn, sizeof(start_pfn),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              start_pfn));
>> +    if (ret < 0) {
>> +        error_report("Failed to set dirty pages start address %d %s",
>> +                ret, strerror(errno));
>> +        return;
>> +    }
>> +
>> +    ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              page_size));
>> +    if (ret < 0) {
>> +        error_report("Failed to set dirty page size %d %s",
>> +                ret, strerror(errno));
>> +        return;
>> +    }
>> +
>> +    ret = pwrite(vbasedev->fd, &pfn_count, sizeof(pfn_count),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              total_pfns));
>> +    if (ret < 0) {
>> +        error_report("Failed to set dirty page total pfns %d %s",
>> +                ret, strerror(errno));
>> +        return;
>> +    }
>> +
>> +    do {
>> +        uint64_t bitmap_size;
>> +        void *buf = NULL;
>> +        bool buffer_mmaped = false;
>> +
>> +        /* Read dirty_pfns.copied */
>> +        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             copied_pfns));
>> +        if (ret < 0) {
>> +            error_report("Failed to get dirty pages bitmap count %d %s",
>> +                    ret, strerror(errno));
>> +            return;
>> +        }
>> +
>> +        if (copied_pfns == 0) {
>> +            /*
>> +             * copied_pfns could be 0 if driver doesn't have any page to report
>> +             * dirty in given range
>> +             */
>> +            break;
>> +        }
>> +
>> +        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
>> +
>> +        if (region->mmaps) {
>> +            int i;
>> +
>> +            for (i = 0; i < region->nr_mmaps; i++) {
>> +                if (region->mmaps[i].size >= bitmap_size) {
>> +                    buf = region->mmaps[i].mmap;
>> +                    buffer_mmaped = true;
>> +                    break;
>> +                }
> 
> Wait, the first sparse mmap region that's big enough to hold the bitmap
> is necessarily the one where we read it?  Shouldn't we test that the
> mmap range covers the correct offset?  This also makes me realize that
> if a driver wanted the device memory mmap'd but not the dirty memory
> backing, I don't think they can do it with this interface.  Perhaps
> there should be separate data and dirty offset fields in the header to
> that a driver can place them at separate offsets with different mmap
> coverage if they desire.  Thanks,
>

Makes sense. I'll add dirty_offset in header.

Thanks,
Kirti


> Alex
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device
  2019-02-21  7:10       ` Neo Jia
@ 2019-02-27  1:51         ` Tian, Kevin
  0 siblings, 0 replies; 17+ messages in thread
From: Tian, Kevin @ 2019-02-27  1:51 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Dr. David Alan Gilbert, alex.williamson, Yang,
	Ziye, Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe, Zhengxiao.zx,
	shuangtai.tst, Ken.Xue, Wang, Zhi A, Zhao, Yan Y, yulei.zhang,
	qemu-devel

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Thursday, February 21, 2019 3:10 PM
> 
> On Thu, Feb 21, 2019 at 05:52:53AM +0000, Tian, Kevin wrote:
> > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > Sent: Thursday, February 21, 2019 1:25 PM
> > >
> > > On 2/20/2019 3:52 PM, Dr. David Alan Gilbert wrote:
> > > > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > >> Add migration support for VFIO device
> > > >
> > > > Hi Kirti,
> > > >   Can you explain how this compares and works with Yan Zhao's
> > > > set?
> > >
> > > This patch set is incremental version of my previous patch set:
> > > https://patchwork.ozlabs.org/cover/1000719/
> > > This takes care of the feedbacks received on previous version.
> > >
> > > This patch set is different than Yan Zhao's set.
> > >
> >
> > I can help give some background about Yan's work:
> >
> > There was a big gap between Kirti's last version and the overall review
> > comments, especially this one:
> > https://www.mail-archive.com/qemu-devel@nongnu.org/msg576652.html
> 
> Hi Kevin,
> 
> >
> > Then there was no reply from Kirti whether she agreed with the comments
> > and was working on a new version.
> 
> Sorry, we should ack on those comments when we have received them last
> time.
> 
> >
> > Then we think we should jump in to keep the ball moving, based on
> > a fresh implementation according to recommended direction, i.e. focusing
> > on device state management instead of sticking to migration flow in kernel
> > API design.
> >
> > and also more importantly we provided kernel side implementation based
> > on Intel GVT-g to give the whole picture of both user/kernel side changes.
> > That should give people a better understanding of how those new APIs
> > are expected to be used by Qemu, and to be implemented by vendor driver.
> >
> > That is why Yan just shared her work.
> 
> Really glad to see the v2 version works for you guys, appreciate for the driver
> side changes.
> 
> >
> > Now it's great to see that Kirti is still actively working on this effort and is
> > also moving toward the right direction. Let's have a close look at two
> > implementations and then choose a cleaner one as base for future
> > enhancements. :-)
> 
> Yes, the v3 has addressed all the comments / concerns raised in the v2, I think
> we should take a look and keep moving.
> 
> Just a quick thought - would be possible / better to have Kirti focus on the
> QEMU
> patches and Yan take care GVT-g kernel driver side changes? This will give us
> the best testing coverage. Hope I don't step on anybody's toes here. ;-)
> 

That is one option to be considered. For now let's focus on closing 
kernel interface definition. I saw Alex given lots of comments on both
series. Hope we can discuss and converge on that part soon, before
moving to next version. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface
  2019-02-26 20:05     ` Kirti Wankhede
@ 2019-03-01 13:36       ` Cornelia Huck
  2019-03-07 19:45       ` Alex Williamson
  1 sibling, 0 replies; 17+ messages in thread
From: Cornelia Huck @ 2019-03-01 13:36 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Alex Williamson, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, yulei.zhang, qemu-devel

On Wed, 27 Feb 2019 01:35:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Alex,
> 
> On 2/22/2019 3:53 AM, Alex Williamson wrote:
> > On Wed, 20 Feb 2019 02:53:16 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> - Defined MIGRATION region type and sub-type.
> >> - Used 2 bits to define VFIO device states.
> >>     Bit 0 => 0/1 => _STOPPED/_RUNNING
> >>     Bit 1 => 0/1 => _RESUMING/_SAVING
> >>     Combination of these bits defines VFIO device's state during migration
> >>     _RUNNING => Normal VFIO device running state.
> >>     _STOPPED => VFIO device stopped.
> >>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
> >>                           saving state of device i.e. pre-copy state
> >>     _SAVING | _STOPPED => vCPUs are stoppped, VFIO device should be stopped, and
> >>                           save device state,i.e. stop-n-copy state
> >>     _RESUMING => VFIO device resuming state.  
> > 
> > Shouldn't we have a non-_RESUMING/_SAVING run state?  If these are
> > indicating directly flow, maybe we need two bits:
> > 
> >   00b - None, normal runtime
> >   01b - Saving
> >   10b - Resuming
> >   11b - Invalid/reserved (maybe a Failed state indicator)
> >   
> 
> There has to be 2 more states:
> _SAVING | _RUNNING => SAVING while device is RUNNING - pre-copy phase
> _SAVING | _STOPPED => SAVING while device is STOPPED - stop-and-copy phase
> 
> So the 2 bits used in this patch are:
> 00b - _RESUMING
> 01b - _RUNNING - Normal Running
> 10b - _SAVING | _STOPPED - stop-and-copy phase
> 11b - _SAVING | _RUNNING - pre-copy phase

Not sure if the "use two bits" approach is the most elegant here --
rightmost bit unset can mean either _RESUMING or _STOPPED...

What about simply making this four distinct states?

#define _RESUMING 0
#define _RUNNING 1 //or the other way around?
#define _SAVING_STOPPED 2
#define _SAVING_RUNNING 3

You could still check if you're currently SAVING by ANDing with 10b.

(...)

> >> + *
> >> + * data_offset: (read/write)
> >> + *      User application should read data_offset in migration region from where
> >> + *      user application should read data during _SAVING state.
> >> + *      User application would write data_offset in migration region from where
> >> + *      user application is had written data during _RESUMING state.  
> > 
> > Why wouldn't data_offset simply be fixed?  Do we really want to support
> > the user writing a arbitrary offsets in the migration region?  Each
> > chunk can simply start at the kernel provided data_offset.
> >  
> 
> data_offset differs based on region is trapped on mapped. In case when
> region is mmaped, QEMU writes data to mapped region and then write
> data_offset field, indicating data is present in mmaped buffer. Read
> below for more detailed steps.
> 
> >> + *
> >> + * data_size: (write only)
> >> + *      User application should write size of data copied in migration region
> >> + *      during _RESUMING state.  
> > 
> > How much does the user read on _SAVING then?
> >   
> 
> pending_bytes tells bytes that should be read on _SAVING.
> 
> >> + *
> >> + * start_pfn: (write only)
> >> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> >> + *      _SAVING state.
> >> + *
> >> + * page_size: (write only)
> >> + *      User application should write the page_size of pfn.
> >> + *
> >> + * total_pfns: (write only)
> >> + *      Total pfn count from start_pfn for which dirty bitmap is requested.  
> > 
> > So effectively the area that begins at data_offset is dual purpose
> > (triple purpose vs Yan's, config, memory, and dirty bitmap) 

I'm wondering whether splitting it would combine the best of the two
approaches: Just having an optional dirty bitmap region would make it
more obvious if dirty page tracking is not available.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface
  2019-02-26 20:05     ` Kirti Wankhede
  2019-03-01 13:36       ` Cornelia Huck
@ 2019-03-07 19:45       ` Alex Williamson
  1 sibling, 0 replies; 17+ messages in thread
From: Alex Williamson @ 2019-03-07 19:45 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, yulei.zhang, qemu-devel

On Wed, 27 Feb 2019 01:35:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Alex,
> 
> On 2/22/2019 3:53 AM, Alex Williamson wrote:
> > On Wed, 20 Feb 2019 02:53:16 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> - Defined MIGRATION region type and sub-type.
> >> - Used 2 bits to define VFIO device states.
> >>     Bit 0 => 0/1 => _STOPPED/_RUNNING
> >>     Bit 1 => 0/1 => _RESUMING/_SAVING
> >>     Combination of these bits defines VFIO device's state during migration
> >>     _RUNNING => Normal VFIO device running state.
> >>     _STOPPED => VFIO device stopped.
> >>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
> >>                           saving state of device i.e. pre-copy state
> >>     _SAVING | _STOPPED => vCPUs are stoppped, VFIO device should be stopped, and
> >>                           save device state,i.e. stop-n-copy state
> >>     _RESUMING => VFIO device resuming state.  
> > 
> > Shouldn't we have a non-_RESUMING/_SAVING run state?  If these are
> > indicating directly flow, maybe we need two bits:
> > 
> >   00b - None, normal runtime
> >   01b - Saving
> >   10b - Resuming
> >   11b - Invalid/reserved (maybe a Failed state indicator)
> >   
> 
> There has to be 2 more states:
> _SAVING | _RUNNING => SAVING while device is RUNNING - pre-copy phase
> _SAVING | _STOPPED => SAVING while device is STOPPED - stop-and-copy phase

Sorry, this was unclear, a separate third bit would need to be the stop
bit, the above was only focusing on the data direction, because...

> So the 2 bits used in this patch are:
> 00b - _RESUMING

This is actually:

_RESUMING | _STOPPED

> 01b - _RUNNING - Normal Running

And this is really:

_RESUMING | _RUNNING

We're just selectively ignoring _RESUMING when we choose to.  So my
question is really more like:

|0|00|
 ^ ^^
 | ||
 | |+-- Saving
 | +--- Resuming
 +----- Stopped

Where Saving|Resuming both set is either invalid or a failure indicator.

> 10b - _SAVING | _STOPPED - stop-and-copy phase
> 11b - _SAVING | _RUNNING - pre-copy phase
> 
> 
> >> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>   offset of migration region to get/set VFIO device related information.
> >>   Defined members of structure and usage on read/write access:
> >>     * device_state: (write only)
> >>         To convey VFIO device state to be transitioned to.  
> > 
> > Seems trivial and potentially useful to support read here, we have 30
> > (or maybe 29) bits yet to define.
> >   
> 
> Ok, those bits can be used later.
> 
> >>     * pending bytes: (read only)
> >>         To get pending bytes yet to be migrated for VFIO device
> >>     * data_offset: (read/write)
> >>         To get or set data offset in migration from where data exist
> >>         during _SAVING and _RESUMING state  
> > 
> > What's the use case for writing this?
> >   
> 
> During resuming, user space application (QEMU) writes data to migration
> region. At that time QEMU writes data_offset from where data is written,
> so that vendor driver can read data from that offset. If data section is
> mmapped, data_offset is start of mmapped region where as if data section
> is trapped, data_offset is sizeof(struct vfio_device_migration_info) +
> 1, just after vfio_device_migration_info structure.

It doesn't make sense to me that this proposal allows the user to write
wherever they want, the vendor driver defines whether they support mmap
for a given operation and the user can choose to make use of mmap or
not.  Therefore I'd suggest that the vendor driver expose a read-only
data_offset that matches a sparse mmap capability entry should the
driver support mmap.  The use should always read or write data from the
vendor defined data_offset.  Otherwise we have scenarios like the other
patch I commented on where the user implicitly decides that the first
mmap region large enough is the one to be used for a given operation,
removing the vendor driver's ability to decide whether it wants to
support mmap for that operation.

> >>     * data_size: (write only)
> >>         To convey size of data copied in migration region during _RESUMING
> >>         state  
> > 
> > How to know how much is available for read?  
> 
> pending_bytes tells how much is still to be read.

But pending_bytes can be bigger than the region, right?  Does the data
area necessarily extend to the end of the region?
 
> >>     * start_pfn, page_size, total_pfns: (write only)
> >>         To get bitmap of dirty pages from vendor driver from given
> >>         start address for total_pfns.  
> > 
> > What would happen if a user wrote in 1MB for page size?  Is the vendor
> > driver expected to support arbitrary page sizes?  Are we only trying to
> > convey the page size and would that page size ever be other than
> > getpagesize()?
> >   
> >>     * copied_pfns: (read only)
> >>         To get number of pfns bitmap copied in migration region.
> >>         Vendor driver should copy the bitmap with bits set only for
> >>         pages to be marked dirty in migration region. Vendor driver
> >>         should return 0 if there are 0 pages dirty in requested
> >>         range.  
> > 
> > This is useful, but I wonder if it's really a required feature for the
> > vendor driver.  For instance, with mdev IOMMU support we could wrap an
> > arbitrary PCI device as mdev, but we don't necessarily have dirty page
> > tracking.  Would a device need to report -1 here if it wanted to
> > indicate any page could be dirty if we only know how to collect the
> > state of the device itself for migration (ie. force the device to be
> > stopped first).
> >    
> 
> Does that mean if returned -1 here, mark all pages in the section as dirty?

Yes
 
> >> Migration region looks like:
> >>  ------------------------------------------------------------------
> >> |vfio_device_migration_info|    data section                      |
> >> |                          |     ///////////////////////////////  |
> >>  ------------------------------------------------------------------  
> >                              ^ what's this?
> >   
> >>  ^                              ^                              ^
> >>  offset 0-trapped part        data.offset                 data.size  
> > 
> > Isn't data.size above really (data.offset + data.size)?  '.' vs '_'
> > inconsistency vs above.
> >    
> 
> Yes, it should be '_'. I'll correct that.
> 
> 
> >> Data section is always followed by vfio_device_migration_info
> >> structure in the region, so data.offset will always be none-0.  
> > 
> > This seems exactly backwards from the diagram, data section follows
> > vfio_device_migration_info.  Also, "non-zero".
> >   
> >> Offset from where data is copied is decided by kernel driver, data  
> > 
> > But data_offset is listed as read-write.
> >   
> 
> data_offset is read during _SAVING state so QEMU knows from where to
> read data
> data_offset is written during _RESUMING state so that QEMU can convey
> offset in migration region to vendor driver from where data should be
> considered as input data.

Please let's drop the idea that the user can write data to an arbitrary
offset within the region.  Does it serve any value?
 
> >> section can be trapped or mapped depending on how kernel driver
> >> defines data section. If mmapped, then data.offset should be page
> >> aligned, where as initial section which contain
> >> vfio_device_migration_info structure might not end at offset which
> >> is page aligned.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 65 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 12a7b1dc53c8..1b12a9b95e00 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -368,6 +368,71 @@ struct vfio_region_gfx_edid {
> >>   */
> >>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> >>  
> >> +/* Migration region type and sub-type */
> >> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
> >> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> >> +
> >> +/**
> >> + * Structure vfio_device_migration_info is placed at 0th offset of
> >> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >> + * information. Field accesses from this structure are only supported at their
> >> + * native width and alignment, otherwise should return error.
> >> + *
> >> + * device_state: (write only)
> >> + *      To indicate vendor driver the state VFIO device should be transitioned
> >> + *      to. If device state transition fails, write to this field return error.
> >> + *      It consists of 2 bits.
> >> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> >> + *        device before write returns.
> >> + *      - If bit 1 set, indicates _SAVING state. When its reset, that indicates
> >> + *        _RESUMING state.
> >> + *
> >> + * pending bytes: (read only)
> >> + *      Read pending bytes yet to be migrated from vendor driver  
> > 
> > Is this essentially a volatile value, changing based on data previously
> > copied and device activity?  
> 
> Yes.
> 
> >   
> >> + *
> >> + * data_offset: (read/write)
> >> + *      User application should read data_offset in migration region from where
> >> + *      user application should read data during _SAVING state.
> >> + *      User application would write data_offset in migration region from where
> >> + *      user application is had written data during _RESUMING state.  
> > 
> > Why wouldn't data_offset simply be fixed?  Do we really want to support
> > the user writing a arbitrary offsets in the migration region?  Each
> > chunk can simply start at the kernel provided data_offset.
> >  
> 
> data_offset differs based on region is trapped on mapped. In case when
> region is mmaped, QEMU writes data to mapped region and then write
> data_offset field, indicating data is present in mmaped buffer. Read
> below for more detailed steps.
> 
> >> + *
> >> + * data_size: (write only)
> >> + *      User application should write size of data copied in migration region
> >> + *      during _RESUMING state.  
> > 
> > How much does the user read on _SAVING then?
> >   
> 
> pending_bytes tells bytes that should be read on _SAVING.

So pending_bytes is always less than or equal to the size of the
region, thus pending_bytes cannot be used to gauge the _total_ pending
device state?

> >> + *
> >> + * start_pfn: (write only)
> >> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> >> + *      _SAVING state.
> >> + *
> >> + * page_size: (write only)
> >> + *      User application should write the page_size of pfn.
> >> + *
> >> + * total_pfns: (write only)
> >> + *      Total pfn count from start_pfn for which dirty bitmap is requested.  
> > 
> > So effectively the area that begins at data_offset is dual purpose
> > (triple purpose vs Yan's, config, memory, and dirty bitmap) but the
> > protocol isn't entirely evident to me. I think we need to write it out
> > as Yan provided.  If I'm in the _SAVING state, do I simply read from
> > data_offset to min(data_size, region.size - data_offset)?  If that area
> > is mmap'd, how does the user indicate to the kernel to prepare the data
> > or that X bytes were acquired?  In the _RESUMING state, do I write from
> > data_offset to min(data_size, region.size - data_offset) and indicate
> > the write using data_size?
> >   
> 
> Let me list down the steps for each state, hope that helps to answer all
> above questions.
> 
> In _SAVING|_RUNNING device state:
> - read pending_bytes
> - read data_offset - indicates kernel driver to write data to staging
> buffer which is mmapped.
> - if data section is trapped, pread() from data_offset to
> min(pending_bytes, remaining_region)
> - if data section is mmaped, read mmaped buffer of size
> min(pending_bytes, remaining_region)

Ok, pending_bytes can describe more than the region currently holds.
It still seems worthwhile to have a data_size so that we can extend
this interface.  For instance what if a driver wanted to trap data
accesses but mmap a dirty bitmap.  This interface doesn't allow for
that since they're both necessarily covered by the same sparse mmap
offset.

> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> 
> In _SAVING|_STOPPED device state:
> a. read config space of device and save to migration file stream. This
> doesn't need to be from vendor driver. Any other special config state
> from driver can be saved as data in following iteration.
> b. read pending_bytes - indicates kernel driver to write data to staging
> buffer which is mmapped.
> c. if data section is trapped, pread() from data_offset to
> min(pending_bytes, remaining_region)
> d. if data section is mmaped, read mmaped buffer of size
> min(pending_bytes, remaining_region)
> e. Write data packet as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> f. iterate through steps b to e until (pending_bytes > 0)

What indicates to the vendor driver to deduct from pending_bytes and
advance the data?  It seems like it's assumed that a read of
pending_bytes indicates this, but I don't like that implicit
interaction, a user could be interrupted and read pending_bytes again
to see if there's still data and introduce data loss.  IMO, there
should be an explicit "Ok, I read # bytes, advance the data stream"
type operation.

> g. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> 
> Dirty page tracking (.log_sync) is part of RAM copying state, where
> vendor driver provides the bitmap of pages which are dirtied by vendor
> driver through migration region and as part of RAM copy, those pages
> gets copied to file stream.
> 
> 
> In _RESUMING device state:
> - load config state.
> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
>     - read data_size from packet, read buffer of data_size
>         if region is mmaped, write data of data_size to mmaped region.
>     - write data_offset and data_size.
>         In case of mmapped region, write to data_size indicates kernel
> driver that data is written in staging buffer.
>     - if region is trapped, pwrite() data of data_size from data_offset.

I still think data_offset should be read_only and this should use the
same operation above to indicate how many bytes were written rather
than read.  Thanks,

Alex

> >> + *
> >> + * copied_pfns: (read only)
> >> + *      pfn count for which dirty bitmap is copied to migration region.
> >> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> >> + *      marked dirty in migration region.
> >> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
> >> + *      range.
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +        __u32 device_state;         /* VFIO device state */
> >> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> >> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> >> +        __u32 reserved;
> >> +        __u64 pending_bytes;
> >> +        __u64 data_offset;
> >> +        __u64 data_size;
> >> +        __u64 start_pfn;
> >> +        __u64 page_size;
> >> +        __u64 total_pfns;
> >> +        __u64 copied_pfns;
> >> +} __attribute__((packed));
> >> +  
> > 
> > As you mentioned, and with Yan's version, we still need to figure out
> > something with compatibility and versioning.  Thanks,
> >   
> 
> Yes, we still need to figure out compatibility and versioning.
> 
> Thanks,
> Kirti
> 
> > Alex
> >   
> >>  /*
> >>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >>   * which allows direct access to non-MSIX registers which happened to be within  
> >   

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-03-07 19:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-19 21:23 [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface Kirti Wankhede
2019-02-21 22:23   ` Alex Williamson
2019-02-26 20:05     ` Kirti Wankhede
2019-03-01 13:36       ` Cornelia Huck
2019-03-07 19:45       ` Alex Williamson
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 2/5] Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices Kirti Wankhede
2019-02-21 22:38   ` Alex Williamson
2019-02-26 20:16     ` Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 4/5] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 5/5] Make vfio-pci device migration capable Kirti Wankhede
2019-02-20 10:22 ` [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Dr. David Alan Gilbert
2019-02-21  5:25   ` Kirti Wankhede
2019-02-21  5:52     ` Tian, Kevin
2019-02-21  7:10       ` Neo Jia
2019-02-27  1:51         ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.