All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
@ 2019-07-09  9:49 Kirti Wankhede
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface Kirti Wankhede
                   ` (13 more replies)
  0 siblings, 14 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Add migration support for VFIO device

This Patch set include patches as below:
- Define KABI for VFIO device for migration support.
- Added save and restore functions for PCI configuration space
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.
  * Added function to get dirty pages bitmap for the pages which are used by
    driver.
- Add vfio_listerner_log_sync to mark dirty pages.
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    On migration completion, vCPUs stops and calls .save_live_complete_precopy
    for each active device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, _STOPPED)
                        |
    (FINISH_MIGRATE, _COMPLETED, STOPPED)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, STOPPED)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
                        |
        (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti

Kirti Wankhede (13):
  vfio: KABI for migration interface
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add VM state change handler to know state of VM
  vfio: Add migration state change notifier
  vfio: Register SaveVMHandlers for VFIO device
  vfio: Add save state functions to SaveVMHandlers
  vfio: Add load state functions to SaveVMHandlers
  vfio: Add function to get dirty page list
  vfio: Add vfio_listerner_log_sync to mark dirty pages
  vfio: Make vfio-pci device migration capable.

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  55 +++
 hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 137 ++++++-
 hw/vfio/trace-events          |  19 +
 include/hw/vfio/vfio-common.h |  25 ++
 linux-headers/linux/vfio.h    | 166 ++++++++
 7 files changed, 1271 insertions(+), 7 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0



^ permalink raw reply	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-16 20:56   ` Alex Williamson
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

- Defined MIGRATION region type and sub-type.
- Used 3 bits to define VFIO device states.
    Bit 0 => _RUNNING
    Bit 1 => _SAVING
    Bit 2 => _RESUMING
    Combination of these bits defines VFIO device's state during migration
    _STOPPED => All bits 0 indicates VFIO device stopped.
    _RUNNING => Normal VFIO device running state.
    _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
                          saving state of device i.e. pre-copy state
    _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
                          save device state,i.e. stop-n-copy state
    _RESUMING => VFIO device resuming state.
    _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access:
    * device_state: (read/write)
        To convey VFIO device state to be transitioned to. Only 3 bits are used
        as of now.
    * pending bytes: (read only)
        To get pending bytes yet to be migrated for VFIO device.
    * data_offset: (read only)
        To get data offset in migration from where data exist during _SAVING
        and from where data should be written by user space application during
         _RESUMING state
    * data_size: (read/write)
        To get and set size of data copied in migration region during _SAVING
        and _RESUMING state.
    * start_pfn, page_size, total_pfns: (write only)
        To get bitmap of dirty pages from vendor driver from given
        start address for total_pfns.
    * copied_pfns: (read only)
        To get number of pfns bitmap copied in migration region.
        Vendor driver should copy the bitmap with bits set only for
        pages to be marked dirty in migration region. Vendor driver
        should return 0 if there are 0 pages dirty in requested
        range. Vendor driver should return -1 to mark all pages in the section
        as dirty

Migration region looks like:
 ------------------------------------------------------------------
|vfio_device_migration_info|    data section                      |
|                          |     ///////////////////////////////  |
 ------------------------------------------------------------------
 ^                              ^                              ^
 offset 0-trapped part        data_offset                 data_size

Data section is always followed by vfio_device_migration_info
structure in the region, so data_offset will always be none-0.
Offset from where data is copied is decided by kernel driver, data
section can be trapped or mapped depending on how kernel driver
defines data section. If mmapped, then data_offset should be page
aligned, where as initial section which contain
vfio_device_migration_info structure might not end at offset which
is page aligned.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 166 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 166 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 24f505199f83..6696a4600545 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -372,6 +372,172 @@ struct vfio_region_gfx_edid {
  */
 #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
 
+/* Migration region type and sub-type */
+#define VFIO_REGION_TYPE_MIGRATION	        (2)
+#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
+
+/**
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise should return error.
+ *
+ * device_state: (read/write)
+ *      To indicate vendor driver the state VFIO device should be transitioned
+ *      to. If device state transition fails, write on this field return error.
+ *      It consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
+ *        _STOPPED state. When device is changed to _STOPPED, driver should stop
+ *        device before write() returns.
+ *      - If bit 1 set, indicates _SAVING state.
+ *      - If bit 2 set, indicates _RESUMING state.
+ *      _SAVING and _RESUMING set at the same time is invalid state.
+ *
+ * pending bytes: (read only)
+ *      Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read device data during _SAVING state or write
+ *      device data during _RESUMING state or read dirty pages bitmap. See below
+ *      for detail of sequence to be followed.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to get size of data copied in
+ *      migration region during _SAVING state and write size of data copied in
+ *      migration region during _RESUMING state.
+ *
+ * start_pfn: (write only)
+ *      Start address pfn to get bitmap of dirty pages from vendor driver duing
+ *      _SAVING state.
+ *
+ * page_size: (write only)
+ *      User application should write the page_size of pfn.
+ *
+ * total_pfns: (write only)
+ *      Total pfn count from start_pfn for which dirty bitmap is requested.
+ *
+ * copied_pfns: (read only)
+ *      pfn count for which dirty bitmap is copied to migration region.
+ *      Vendor driver should copy the bitmap with bits set only for pages to be
+ *      marked dirty in migration region.
+ *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if none of the
+ *        pages are dirty in requested range or rest of the range.
+ *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
+ *        pages dirty in the given section.
+ *      - Vendor driver should return pfn count for which bitmap is written in
+ *        the region.
+ *
+ * Migration region looks like:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^                              ^
+ *  offset 0-trapped part        data_offset                 data_size
+ *
+ * Data section is always followed by vfio_device_migration_info structure
+ * in the region, so data_offset will always be none-0. Offset from where data
+ * is copied is decided by kernel driver, data section can be trapped or
+ * mapped depending on how kernel driver defines data section. If mmapped,
+ * then data_offset should be page aligned, where as initial section which
+ * contain vfio_device_migration_info structure might not end at offset which
+ * is page aligned.
+ * Data_offset can be same or different for device data and dirty page bitmap.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed:
+ * In _SAVING|_RUNNING device state or pre-copy phase:
+ * a. read pending_bytes. If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates kernel driver to write data to staging buffer
+ *    which is mmapped.
+ * c. read data_size, amount of data in bytes written by vendor driver in
+ *    migration region.
+ * d. if data section is trapped, read from data_offset of data_size.
+ * e. if data section is mmaped, read data_size bytes from mmaped buffer from
+ *    data_offset in the migration region.
+ * f. Write data_size and data to file stream.
+ * g. iterate through steps a to f while (pending_bytes > 0)
+ *
+ * In _SAVING device state or stop-and-copy phase:
+ * a. read config space of device and save to migration file stream. This
+ *    doesn't need to be from vendor driver. Any other special config state
+ *    from driver can be saved as data in following iteration.
+ * b. read pending_bytes.
+ * c. read data_offset, indicates kernel driver to write data to staging
+ *    buffer which is mmapped.
+ * d. read data_size, amount of data in bytes written by vendor driver in
+ *    migration region.
+ * e. if data section is trapped, read from data_offset of data_size.
+ * f. if data section is mmaped, read data_size bytes from mmaped buffer from
+ *    data_offset in the migration region.
+ * g. Write data_size and data to file stream
+ * h. iterate through steps b to g while (pending_bytes > 0)
+ *
+ * When data region is mapped, its user's responsibility to read data from
+ * data_offset of data_size before moving to next steps.
+ *
+ * Dirty page tracking is part of RAM copy state, where vendor driver
+ * provides the bitmap of pages which are dirtied by vendor driver through
+ * migration region and as part of RAM copy those pages gets copied to file
+ * stream.
+ *
+ * To get dirty page bitmap:
+ * a. write start_pfn, page_size and total_pfns.
+ * b. read copied_pfns.
+ *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if driver
+ *       doesn't have any page to report dirty in given range or rest of the
+ *       range. Exit loop.
+ *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
+ *       pages dirty for given range. Mark all pages in the range as dirty and
+ *       exit the loop.
+ *     - Vendor driver should return copied_pfns and provide bitmap for
+ *       copied_pfn, which means that bitmap copied for given range contains
+ *       information for all pages where some bits are 0s and some are 1s.
+ * c. read data_offset, where vendor driver has written bitmap.
+ * d. read bitmap from the region or mmaped part of the region.
+ * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
+ *
+ * In _RESUMING device state:
+ * - Load device config state.
+ * - While end of data for this device is not reached, repeat below steps:
+ *      - read data_size from file stream, read data from file stream of
+ *        data_size.
+ *      - read data_offset from where User application should write data.
+ *          if region is mmaped, write data of data_size to mmaped region.
+ *      - write data_size.
+ *          In case of mmapped region, write on data_size indicates kernel
+ *          driver that data is written in staging buffer.
+ *      - if region is trapped, write data of data_size from data_offset.
+ *
+ * For user application, data is opaque. User should write data in the same
+ * order as received.
+ */
+
+struct vfio_device_migration_info {
+        __u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+                                     VFIO_DEVICE_STATE_SAVING | \
+                                     VFIO_DEVICE_STATE_RESUMING)
+#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
+                                     VFIO_DEVICE_STATE_RESUMING)
+        __u32 reserved;
+        __u64 pending_bytes;
+        __u64 data_offset;
+        __u64 data_size;
+        __u64 start_pfn;
+        __u64 page_size;
+        __u64 total_pfns;
+        __u64 copied_pfns;
+#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
+#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 02/13] vfio: Add function to unmap VFIO region
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-16 16:29   ` Cornelia Huck
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

This function is used in follwing patch in this series.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c              | 20 ++++++++++++++++++++
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 22 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a859298fdad9..de74dae8d6a6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -964,6 +964,26 @@ int vfio_region_mmap(VFIORegion *region)
     return 0;
 }
 
+void vfio_region_unmap(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        trace_vfio_region_unmap(memory_region_name(&region->mmaps[i].mem),
+                                region->mmaps[i].offset,
+                                region->mmaps[i].offset +
+                                region->mmaps[i].size - 1);
+        memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
+        munmap(region->mmaps[i].mmap, region->mmaps[i].size);
+        object_unparent(OBJECT(&region->mmaps[i].mem));
+        region->mmaps[i].mmap = NULL;
+    }
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
     int i;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33ffd..8cdc27946cb8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -111,6 +111,7 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 9107bd41c030..93493891ba40 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                       int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface Kirti Wankhede
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-16 16:32   ` Cornelia Huck
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Hook vfio_get_object callback for PCI devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Suggested-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/pci.c                 | 8 ++++++++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d7a4e1875c05..de0d286fc9dd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2388,10 +2388,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice *vbasedev)
     }
 }
 
+static Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+    return OBJECT(vdev);
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
+    .vfio_get_object = vfio_pci_get_object,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 93493891ba40..771b6d59a3db 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -119,6 +119,7 @@ struct VFIODeviceOps {
     void (*vfio_compute_needs_reset)(VFIODevice *vdev);
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
+    Object *(*vfio_get_object)(VFIODevice *vdev);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (2 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-11 12:07   ` Dr. David Alan Gilbert
  2019-07-16 21:14   ` Alex Williamson
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

These functions save and restore PCI device specific data - config
space of PCI device.
Tested save and restore with MSI and MSIX type.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c                 | 114 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   2 +
 2 files changed, 116 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index de0d286fc9dd..5fe4f8076cac 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2395,11 +2395,125 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
     return OBJECT(vdev);
 }
 
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint16_t pci_cmd;
+    int i;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar;
+
+        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+        qemu_put_be32(f, bar);
+    }
+
+    qemu_put_be32(f, vdev->interrupt);
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+
+        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                                            2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        qemu_put_be32(f, msi_addr_lo);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                             4);
+        }
+        qemu_put_be32(f, msi_addr_hi);
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+        qemu_put_be32(f, msi_data);
+    } else if (vdev->interrupt == VFIO_INT_MSIX) {
+        uint16_t offset;
+
+        /* save enable bit and maskall bit */
+        offset = pci_default_read_config(pdev,
+                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
+        qemu_put_be16(f, offset);
+        msix_save(pdev, f);
+    }
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    qemu_put_be16(f, pci_cmd);
+}
+
+static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t interrupt_type;
+    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+    uint16_t pci_cmd;
+    bool msi_64bit;
+    int i;
+
+    /* retore pci bar configuration */
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar = qemu_get_be32(f);
+
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+    }
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
+
+    interrupt_type = qemu_get_be32(f);
+
+    if (interrupt_type == VFIO_INT_MSI) {
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+        msi_addr_lo = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo, 4);
+
+        msi_addr_hi = qemu_get_be32(f);
+        if (msi_64bit) {
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+        msi_data = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                msi_data, 2);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
+    } else if (interrupt_type == VFIO_INT_MSIX) {
+        uint16_t offset = qemu_get_be16(f);
+
+        /* load enable bit and maskall bit */
+        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
+                              offset, 2);
+        msix_load(pdev, f);
+    }
+    pci_cmd = qemu_get_be16(f);
+    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
     .vfio_get_object = vfio_pci_get_object,
+    .vfio_save_config = vfio_pci_save_config,
+    .vfio_load_config = vfio_pci_load_config,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 771b6d59a3db..ee72bd984a36 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
     Object *(*vfio_get_object)(VFIODevice *vdev);
+    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+    void (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 05/13] vfio: Add migration region initialization and finalize function
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (3 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-16 21:37   ` Alex Williamson
  2019-07-23 12:52   ` Cornelia Huck
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

- Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
  patch series.
- VFIO device supports migration or not is decided based of migration region
  query. If migration region query is successful and migration region
  initialization is successful then migration is supported else migration is
  blocked.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 145 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   3 +
 include/hw/vfio/vfio-common.h |  14 ++++
 4 files changed, 163 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index abad8b818c9b..36033d1437c5 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,4 +1,4 @@
-obj-y += common.o spapr.o
+obj-y += common.o spapr.o migration.o
 obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_VFIO_PLATFORM) += platform.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..a2cfbd5af2e1
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,145 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2019
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.buffer.size) {
+        vfio_region_exit(&migration->region.buffer);
+        vfio_region_finalize(&migration->region.buffer);
+    }
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    Object *obj = NULL;
+    int ret = -EINVAL;
+
+    if (!migration) {
+        return ret;
+    }
+
+    if (!vbasedev->ops || !vbasedev->ops->vfio_get_object) {
+        return ret;
+    }
+
+    obj = vbasedev->ops->vfio_get_object(vbasedev);
+    if (!obj) {
+        return ret;
+    }
+
+    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
+                            migration->region.index, "migration");
+    if (ret) {
+        error_report("%s: Failed to setup VFIO migration region %d: %s",
+                     vbasedev->name, migration->region.index, strerror(-ret));
+        goto err;
+    }
+
+    if (!migration->region.buffer.size) {
+        ret = -EINVAL;
+        error_report("%s: Invalid region size of VFIO migration region %d: %s",
+                     vbasedev->name, migration->region.index, strerror(-ret));
+        goto err;
+    }
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+                               struct vfio_region_info *info)
+{
+    int ret;
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+    vbasedev->migration->region.index = info->index;
+
+    ret = vfio_migration_region_init(vbasedev);
+    if (ret) {
+        error_report("%s: Failed to initialise migration region",
+                     vbasedev->name);
+        return ret;
+    }
+
+    return 0;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_region_info *info;
+    Error *local_err = NULL;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    ret = vfio_migration_init(vbasedev, info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    trace_vfio_migration_probe(vbasedev->name, info->index);
+    return 0;
+
+add_blocker:
+    error_setg(&vbasedev->migration_blocker,
+               "VFIO device doesn't support migration");
+    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vbasedev->migration_blocker);
+    }
+    return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (!vbasedev->migration) {
+        return;
+    }
+
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    g_free(vbasedev->migration);
+}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8cdc27946cb8..191a726a1312 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
 vfio_display_edid_link_down(void) ""
 vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
+
+# migration.c
+vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ee72bd984a36..152da3f8d6f3 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -57,6 +57,15 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    struct {
+        VFIORegion buffer;
+        uint32_t index;
+    } region;
+    uint64_t pending_bytes;
+    QemuMutex lock;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -113,6 +122,8 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -204,4 +215,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (4 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-11 12:13   ` Dr. David Alan Gilbert
                     ` (2 more replies)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 07/13] vfio: Add migration state change notifier Kirti Wankhede
                   ` (7 subsequent siblings)
  13 siblings, 3 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

VM state change handler gets called on change in VM's state. This is used to set
VFIO device state to _RUNNING.
VM state change handler, migration state change handler and log_sync listener
are called asynchronously, which sometimes lead to data corruption in migration
region. Initialised mutex that is used to serialize operations on migration data
region during saving state.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 64 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  2 ++
 include/hw/vfio/vfio-common.h |  4 +++
 3 files changed, 70 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a2cfbd5af2e1..c01f08b659d0 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -78,6 +78,60 @@ err:
     return ret;
 }
 
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint32_t device_state;
+    int ret = 0;
+
+    device_state = (state & VFIO_DEVICE_STATE_MASK) |
+                   (vbasedev->device_state & ~VFIO_DEVICE_STATE_MASK);
+
+    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
+        return -EINVAL;
+    }
+
+    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("%s: Failed to set device state %d %s",
+                     vbasedev->name, ret, strerror(errno));
+        return ret;
+    }
+
+    vbasedev->device_state = device_state;
+    trace_vfio_migration_set_state(vbasedev->name, device_state);
+    return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running)) {
+        int ret;
+        uint32_t dev_state;
+
+        if (running) {
+            dev_state = VFIO_DEVICE_STATE_RUNNING;
+        } else {
+            dev_state = (vbasedev->device_state & VFIO_DEVICE_STATE_MASK) &
+                     ~VFIO_DEVICE_STATE_RUNNING;
+        }
+
+        ret = vfio_migration_set_state(vbasedev, dev_state);
+        if (ret) {
+            error_report("%s: Failed to set device state 0x%x",
+                         vbasedev->name, dev_state);
+        }
+        vbasedev->vm_running = running;
+        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
+                                  dev_state);
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -93,6 +147,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    qemu_mutex_init(&vbasedev->migration->lock);
+
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
+
     return 0;
 }
 
@@ -135,11 +194,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
         return;
     }
 
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+    }
+
     if (vbasedev->migration_blocker) {
         migrate_del_blocker(vbasedev->migration_blocker);
         error_free(vbasedev->migration_blocker);
     }
 
+    qemu_mutex_destroy(&vbasedev->migration->lock);
     vfio_migration_region_exit(vbasedev);
     g_free(vbasedev->migration);
 }
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 191a726a1312..3d15bacd031a 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 152da3f8d6f3..f6c70db3a9c1 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
+#include "sysemu/sysemu.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -124,6 +125,9 @@ typedef struct VFIODevice {
     unsigned int flags;
     VFIOMigration *migration;
     Error *migration_blocker;
+    uint32_t device_state;
+    VMChangeStateEntry *vm_state;
+    int vm_running;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 07/13] vfio: Add migration state change notifier
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (5 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-17  2:25   ` Yan Zhao
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 08/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to vendor
driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 54 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 56 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index c01f08b659d0..e4a89a6f9bc7 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -132,6 +132,53 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
     }
 }
 
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
+
+    switch (s->state) {
+    case MIGRATION_STATUS_ACTIVE:
+        if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
+            if (vbasedev->vm_running) {
+                ret = vfio_migration_set_state(vbasedev,
+                          VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
+                if (ret) {
+                    error_report("%s: Failed to set state RUNNING and SAVING",
+                                  vbasedev->name);
+                }
+            } else {
+                ret = vfio_migration_set_state(vbasedev,
+                                               VFIO_DEVICE_STATE_SAVING);
+                if (ret) {
+                    error_report("%s: Failed to set state STOP and SAVING",
+                                 vbasedev->name);
+                }
+            }
+        } else {
+            ret = vfio_migration_set_state(vbasedev,
+                                           VFIO_DEVICE_STATE_RESUMING);
+            if (ret) {
+                error_report("%s: Failed to set state RESUMING",
+                             vbasedev->name);
+            }
+        }
+        return;
+
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("%s: Failed to set state RUNNING", vbasedev->name);
+        }
+        return;
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -152,6 +199,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
+
     return 0;
 }
 
@@ -194,6 +244,10 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
         return;
     }
 
+    if (vbasedev->migration_state.notify) {
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
     if (vbasedev->vm_state) {
         qemu_del_vm_change_state_handler(vbasedev->vm_state);
     }
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 3d15bacd031a..69503228f20e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,3 +148,4 @@ vfio_display_edid_write_error(void) ""
 vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f6c70db3a9c1..a022484d2636 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -128,6 +128,7 @@ typedef struct VFIODevice {
     uint32_t device_state;
     VMChangeStateEntry *vm_state;
     int vm_running;
+    Notifier migration_state;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 08/13] vfio: Register SaveVMHandlers for VFIO device
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (6 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 07/13] vfio: Add migration state change notifier Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-22  8:34   ` Yan Zhao
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Define flags to be used as delimeter in migration file stream.
Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
region from these functions at source during saving or pre-copy phase.
Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events |  2 ++
 2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e4a89a6f9bc7..0597a45fda2d 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -23,6 +23,17 @@
 #include "pci.h"
 #include "trace.h"
 
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -106,6 +117,74 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
     return 0;
 }
 
+/* ---------------------------------------------------------------------- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    if (migration->region.buffer.mmaps) {
+        qemu_mutex_lock_iothread();
+        ret = vfio_region_mmap(&migration->region.buffer);
+        qemu_mutex_unlock_iothread();
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.index,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    if (vbasedev->vm_running) {
+        ret = vfio_migration_set_state(vbasedev,
+                         VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
+        if (ret) {
+            error_report("%s: Failed to set state RUNNING and SAVING",
+                         vbasedev->name);
+            return ret;
+        }
+    } else {
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
+        if (ret) {
+            error_report("%s: Failed to set state STOP and SAVING",
+                         vbasedev->name);
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_setup(vbasedev->name);
+    return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->region.buffer.mmaps) {
+        vfio_region_unmap(&migration->region.buffer);
+    }
+    trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+};
+
+/* ---------------------------------------------------------------------- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -195,7 +274,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     }
 
     qemu_mutex_init(&vbasedev->migration->lock);
-
+    register_savevm_live(vbasedev->dev, "vfio", -1, 1, &savevm_vfio_handlers,
+                         vbasedev);
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 69503228f20e..4bb43f18f315 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
+vfio_save_setup(char *name) " (%s)"
+vfio_save_cleanup(char *name) " (%s)"
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (7 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 08/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-12  2:44   ` Yan Zhao
  2019-07-17  2:50   ` Yan Zhao
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 10/13] vfio: Add load " Kirti Wankhede
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes
- read data_offset - indicates kernel driver to write data to staging
  buffer which is mmapped.
- read data_size - amount of data in bytes written by vendor driver in migration
  region.
- if data section is trapped, pread() from data_offset of data_size.
- if data section is mmaped, read mmaped buffer of data_size.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes
c. read data_offset - indicates kernel driver to write data to staging
   buffer which is mmapped.
d. read data_size - amount of data in bytes written by vendor driver in
   migration region.
e. if data section is trapped, pread() from data_offset of data_size.
f. if data section is mmaped, read mmaped buffer of data_size.
g. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
h. iterate through steps b to g while (pending_bytes > 0)
i. Write {VFIO_MIG_FLAG_END_OF_STATE}

When data region is mapped, its user's responsibility to read data from
data_offset of data_size before moving to next steps.

.save_live_iterate runs outside the iothread lock in the migration case, which
could race with asynchronous call to get dirty page list causing data corruption
in mapped migration region. Mutex added here to serial migration buffer read
operation.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   6 ++
 2 files changed, 252 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 0597a45fda2d..4e9b4cce230b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -117,6 +117,138 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
     return 0;
 }
 
+static void *find_data_region(VFIORegion *region,
+                              uint64_t data_offset,
+                              uint64_t data_size)
+{
+    void *ptr = NULL;
+    int i;
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if ((data_offset >= region->mmaps[i].offset) &&
+            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
+            (data_size <= region->mmaps[i].size)) {
+            ptr = region->mmaps[i].mmap + (data_offset -
+                                           region->mmaps[i].offset);
+            break;
+        }
+    }
+    return ptr;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint64_t data_offset = 0, data_size = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+    if (ret != sizeof(data_offset)) {
+        error_report("%s: Failed to get migration buffer data offset %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_size));
+    if (ret != sizeof(data_size)) {
+        error_report("%s: Failed to get migration buffer data size %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    if (data_size > 0) {
+        void *buf = NULL;
+        bool buffer_mmaped;
+
+        if (region->mmaps) {
+            buf = find_data_region(region, data_offset, data_size);
+        }
+
+        buffer_mmaped = (buf != NULL) ? true : false;
+
+        if (!buffer_mmaped) {
+            buf = g_try_malloc0(data_size);
+            if (!buf) {
+                error_report("%s: Error allocating buffer ", __func__);
+                return -ENOMEM;
+            }
+
+            ret = pread(vbasedev->fd, buf, data_size,
+                        region->fd_offset + data_offset);
+            if (ret != data_size) {
+                error_report("%s: Failed to get migration data %d",
+                             vbasedev->name, ret);
+                g_free(buf);
+                return -EINVAL;
+            }
+        }
+
+        qemu_put_be64(f, data_size);
+        qemu_put_buffer(f, buf, data_size);
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+        migration->pending_bytes -= data_size;
+    } else {
+        qemu_put_be64(f, data_size);
+    }
+
+    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+                           migration->pending_bytes);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return data_size;
+}
+
+static int vfio_update_pending(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint64_t pending_bytes = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending_bytes));
+    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
+        error_report("%s: Failed to get pending bytes %d",
+                     vbasedev->name, ret);
+        migration->pending_bytes = 0;
+        return (ret < 0) ? ret : -EINVAL;
+    }
+
+    migration->pending_bytes = pending_bytes;
+    trace_vfio_update_pending(vbasedev->name, pending_bytes);
+    return 0;
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
+        vbasedev->ops->vfio_save_config(vbasedev, f);
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_device_config_state(vbasedev->name);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -178,9 +310,123 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return;
+    }
+
+    *res_precopy_only += migration->pending_bytes;
+
+    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
+                            *res_postcopy_only, *res_compatible);
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret, data_size;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+
+    qemu_mutex_lock(&migration->lock);
+    data_size = vfio_save_buffer(f, vbasedev);
+    qemu_mutex_unlock(&migration->lock);
+
+    if (data_size < 0) {
+        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
+                     strerror(errno));
+        return data_size;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_iterate(vbasedev->name, data_size);
+    if (data_size == 0) {
+        /* indicates data finished, goto complete phase */
+        return 1;
+    }
+
+    return 0;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state STOP and SAVING",
+                     vbasedev->name);
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    while (migration->pending_bytes > 0) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+        ret = vfio_save_buffer(f, vbasedev);
+        if (ret < 0) {
+            error_report("%s: Failed to save buffer", vbasedev->name);
+            return ret;
+        } else if (ret == 0) {
+            break;
+        }
+
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK);
+    if (ret) {
+        error_report("%s: Failed to set state STOPPED", vbasedev->name);
+        return ret;
+    }
+
+    trace_vfio_save_complete_precopy(vbasedev->name);
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 4bb43f18f315..bdf40ba368c7 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -151,3 +151,9 @@ vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_st
 vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
 vfio_save_setup(char *name) " (%s)"
 vfio_save_cleanup(char *name) " (%s)"
+vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
+vfio_update_pending(char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
+vfio_save_device_config_state(char *name) " (%s)"
+vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
+vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
+vfio_save_complete_precopy(char *name) " (%s)"
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 10/13] vfio: Add load state functions to SaveVMHandlers
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (8 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-12  2:52   ` Yan Zhao
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list Kirti Wankhede
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Flow during _RESUMING device state:
- If Vendor driver defines mappable region, mmap migration region.
- Load config state.
- For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
    - read data_size from packet, read buffer of data_size
    - read data_offset from where QEMU should write data.
        if region is mmaped, write data of data_size to mmaped region.
    - write data_size.
        In case of mmapped region, write to data_size indicates kernel
        driver that data is written in staging buffer.
    - if region is trapped, pwrite() data of data_size from data_offset.
- Repeat above until VFIO_MIG_FLAG_END_OF_STATE.
- Unmap migration region.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   3 +
 2 files changed, 165 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 4e9b4cce230b..5fb4c5329ede 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -249,6 +249,26 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    uint64_t data;
+
+    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+        vbasedev->ops->vfio_load_config(vbasedev, f);
+    }
+
+    data = qemu_get_be64(f);
+    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("%s: Failed loading device config space, "
+                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+        return -EINVAL;
+    }
+
+    trace_vfio_load_device_config_state(vbasedev->name);
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -421,12 +441,154 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+
+    if (migration->region.buffer.mmaps) {
+        ret = vfio_region_mmap(&migration->region.buffer);
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.index,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING);
+    if (ret) {
+        error_report("%s: Failed to set state RESUMING", vbasedev->name);
+    }
+    return ret;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    vfio_save_cleanup(opaque);
+    return 0;
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+    uint64_t data, data_size;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+
+        trace_vfio_load_state(vbasedev->name, data);
+
+        switch (data) {
+        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
+        {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
+        {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return ret;
+            } else {
+                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
+                             vbasedev->name, data);
+                return -EINVAL;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE:
+        {
+            VFIORegion *region = &migration->region.buffer;
+            void *buf = NULL;
+            bool buffer_mmaped = false;
+            uint64_t data_offset = 0;
+
+            data_size = qemu_get_be64(f);
+            if (data_size == 0) {
+                break;
+            }
+
+            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                        region->fd_offset +
+                        offsetof(struct vfio_device_migration_info,
+                        data_offset));
+            if (ret != sizeof(data_offset)) {
+                error_report("%s:Failed to get migration buffer data offset %d",
+                             vbasedev->name, ret);
+                return -EINVAL;
+            }
+
+            if (region->mmaps) {
+                buf = find_data_region(region, data_offset, data_size);
+            }
+
+            buffer_mmaped = (buf != NULL) ? true : false;
+
+            if (!buffer_mmaped) {
+                buf = g_try_malloc0(data_size);
+                if (!buf) {
+                    error_report("%s: Error allocating buffer ", __func__);
+                    return -ENOMEM;
+                }
+            }
+
+            qemu_get_buffer(f, buf, data_size);
+
+            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
+                         region->fd_offset +
+                       offsetof(struct vfio_device_migration_info, data_size));
+            if (ret != sizeof(data_size)) {
+                error_report("%s: Failed to set migration buffer data size %d",
+                             vbasedev->name, ret);
+                if (!buffer_mmaped) {
+                    g_free(buf);
+                }
+                return -EINVAL;
+            }
+
+            if (!buffer_mmaped) {
+                ret = pwrite(vbasedev->fd, buf, data_size,
+                             region->fd_offset + data_offset);
+                g_free(buf);
+
+                if (ret != data_size) {
+                    error_report("%s: Failed to set migration buffer %d",
+                                 vbasedev->name, ret);
+                    return -EINVAL;
+                }
+            }
+            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
+                                              data_size);
+            break;
+        }
+        }
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+        data = qemu_get_be64(f);
+    }
+
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index bdf40ba368c7..ac065b559f4e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -157,3 +157,6 @@ vfio_save_device_config_state(char *name) " (%s)"
 vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
 vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(char *name) " (%s)"
+vfio_load_device_config_state(char *name) " (%s)"
+vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (9 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 10/13] vfio: Add load " Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-12  0:33   ` Yan Zhao
  2019-07-22  8:39   ` Yan Zhao
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 12/13] vfio: Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Dirty page tracking (.log_sync) is part of RAM copying state, where
vendor driver provides the bitmap of pages which are dirtied by vendor
driver through migration region and as part of RAM copy, those pages
gets copied to file stream.

To get dirty page bitmap:
- write start address, page_size and pfn count.
- read count of pfns copied.
    - Vendor driver should return 0 if driver doesn't have any page to
      report dirty in given range.
    - Vendor driver should return -1 to mark all pages dirty for given range.
- read data_offset, where vendor driver has written bitmap.
- read bitmap from the region or mmaped part of the region.
- Iterate above steps till page bitmap for all requested pfns are copied.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 123 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   1 +
 include/hw/vfio/vfio-common.h |   2 +
 3 files changed, 126 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 5fb4c5329ede..ca1a8c0f5f1f 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -269,6 +269,129 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+void vfio_get_dirty_page_list(VFIODevice *vbasedev,
+                              uint64_t start_pfn,
+                              uint64_t pfn_count,
+                              uint64_t page_size)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region.buffer;
+    uint64_t count = 0;
+    int64_t copied_pfns = 0;
+    int64_t total_pfns = pfn_count;
+    int ret;
+
+    qemu_mutex_lock(&migration->lock);
+
+    while (total_pfns > 0) {
+        uint64_t bitmap_size, data_offset = 0;
+        uint64_t start = start_pfn + count;
+        void *buf = NULL;
+        bool buffer_mmaped = false;
+
+        ret = pwrite(vbasedev->fd, &start, sizeof(start),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              start_pfn));
+        if (ret < 0) {
+            error_report("%s: Failed to set dirty pages start address %d %s",
+                         vbasedev->name, ret, strerror(errno));
+            goto dpl_unlock;
+        }
+
+        ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              page_size));
+        if (ret < 0) {
+            error_report("%s: Failed to set dirty page size %d %s",
+                         vbasedev->name, ret, strerror(errno));
+            goto dpl_unlock;
+        }
+
+        ret = pwrite(vbasedev->fd, &total_pfns, sizeof(total_pfns),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              total_pfns));
+        if (ret < 0) {
+            error_report("%s: Failed to set dirty page total pfns %d %s",
+                         vbasedev->name, ret, strerror(errno));
+            goto dpl_unlock;
+        }
+
+        /* Read copied dirty pfns */
+        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             copied_pfns));
+        if (ret < 0) {
+            error_report("%s: Failed to get dirty pages bitmap count %d %s",
+                         vbasedev->name, ret, strerror(errno));
+            goto dpl_unlock;
+        }
+
+        if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_NONE) {
+            /*
+             * copied_pfns could be 0 if driver doesn't have any page to
+             * report dirty in given range
+             */
+            break;
+        } else if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_ALL) {
+            /* Mark all pages dirty for this range */
+            cpu_physical_memory_set_dirty_range(start_pfn * page_size,
+                                                pfn_count * page_size,
+                                                DIRTY_MEMORY_MIGRATION);
+            break;
+        }
+
+        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
+
+        ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+        if (ret != sizeof(data_offset)) {
+            error_report("%s: Failed to get migration buffer data offset %d",
+                         vbasedev->name, ret);
+            goto dpl_unlock;
+        }
+
+        if (region->mmaps) {
+            buf = find_data_region(region, data_offset, bitmap_size);
+        }
+
+        buffer_mmaped = (buf != NULL) ? true : false;
+
+        if (!buffer_mmaped) {
+            buf = g_try_malloc0(bitmap_size);
+            if (!buf) {
+                error_report("%s: Error allocating buffer ", __func__);
+                goto dpl_unlock;
+            }
+
+            ret = pread(vbasedev->fd, buf, bitmap_size,
+                        region->fd_offset + data_offset);
+            if (ret != bitmap_size) {
+                error_report("%s: Failed to get dirty pages bitmap %d",
+                             vbasedev->name, ret);
+                g_free(buf);
+                goto dpl_unlock;
+            }
+        }
+
+        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
+                                               (start_pfn + count) * page_size,
+                                                copied_pfns);
+        count      += copied_pfns;
+        total_pfns -= copied_pfns;
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+    }
+
+    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
+                                   page_size);
+
+dpl_unlock:
+    qemu_mutex_unlock(&migration->lock);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index ac065b559f4e..414a5e69ec5e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
 vfio_load_device_config_state(char *name) " (%s)"
 vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a022484d2636..dc1b83a0b4ef 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -222,5 +222,7 @@ int vfio_spapr_remove_window(VFIOContainer *container,
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
 void vfio_migration_finalize(VFIODevice *vbasedev);
+void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
+                               uint64_t pfn_count, uint64_t page_size);
 
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 12/13] vfio: Add vfio_listerner_log_sync to mark dirty pages
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (10 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-23 13:18   ` Cornelia Huck
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 13/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
  2019-07-11  2:55 ` [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Yan Zhao
  13 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

vfio_listerner_log_sync gets list of dirty pages from vendor driver and mark
those pages dirty when in _SAVING state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index de74dae8d6a6..d5ee35c95e76 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -36,6 +36,7 @@
 #include "sysemu/kvm.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -794,9 +795,43 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_listerner_log_sync(MemoryListener *listener,
+                                    MemoryRegionSection *section)
+{
+    uint64_t start_addr, size, pfn_count;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    if (memory_region_is_ram_device(section->mr)) {
+        return;
+    }
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
+                continue;
+            } else {
+                return;
+            }
+        }
+    }
+
+    start_addr = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    size = int128_get64(section->size);
+    pfn_count = size >> TARGET_PAGE_BITS;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vfio_get_dirty_page_list(vbasedev, start_addr >> TARGET_PAGE_BITS,
+                                     pfn_count, TARGET_PAGE_SIZE);
+        }
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [Qemu-devel] [PATCH v7 13/13] vfio: Make vfio-pci device migration capable.
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (11 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 12/13] vfio: Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
@ 2019-07-09  9:49 ` Kirti Wankhede
  2019-07-11  2:55 ` [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Yan Zhao
  13 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-09  9:49 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Call vfio_migration_probe() and vfio_migration_finalize() functions for
vfio-pci device to enable migration for vfio PCI device.
Removed vfio_pci_vmstate structure.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5fe4f8076cac..2ea17a814d55 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2852,6 +2852,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = DEVICE(vdev);
+    vdev->vbasedev.device_state = 0;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3112,6 +3113,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    ret = vfio_migration_probe(&vdev->vbasedev, errp);
+    if (ret) {
+            error_report("%s: Failed to setup for migration",
+                         vdev->vbasedev.name);
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3131,6 +3138,7 @@ static void vfio_instance_finalize(Object *obj)
     VFIOPCIDevice *vdev = PCI_VFIO(obj);
     VFIOGroup *group = vdev->vbasedev.group;
 
+    vdev->vbasedev.device_state = 0;
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
     g_free(vdev->emulated_config_bits);
@@ -3159,6 +3167,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
@@ -3267,11 +3276,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3279,7 +3283,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (12 preceding siblings ...)
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 13/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2019-07-11  2:55 ` Yan Zhao
  2019-07-11 10:50   ` Dr. David Alan Gilbert
  13 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-11  2:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

Hi Kirti,
There are still unaddressed comments to your patches v4.
Would you mind addressing them?

1. should we register two migration interfaces simultaneously
(https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
2. in each save iteration, how much data is to be saved
(https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
3. do we need extra interface to get data for device state only
(https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
4. definition of dirty page copied_pfn
(https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)

Also, I'm glad to see that you updated code by following my comments below,
but please don't forget to reply my comments next time:)
https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html

Thanks
Yan

On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> Add migration support for VFIO device
> 
> This Patch set include patches as below:
> - Define KABI for VFIO device for migration support.
> - Added save and restore functions for PCI configuration space
> - Generic migration functionality for VFIO device.
>   * This patch set adds functionality only for PCI devices, but can be
>     extended to other VFIO devices.
>   * Added all the basic functions required for pre-copy, stop-and-copy and
>     resume phases of migration.
>   * Added state change notifier and from that notifier function, VFIO
>     device's state changed is conveyed to VFIO device driver.
>   * During save setup phase and resume/load setup phase, migration region
>     is queried and is used to read/write VFIO device data.
>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
>     functionality of iteration during pre-copy phase.
>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>     iteration to read data from VFIO device driver is implemented till pending
>     bytes returned by driver are not zero.
>   * Added function to get dirty pages bitmap for the pages which are used by
>     driver.
> - Add vfio_listerner_log_sync to mark dirty pages.
> - Make VFIO PCI device migration capable. If migration region is not provided by
>   driver, migration is blocked.
> 
> Below is the flow of state change for live migration where states in brackets
> represent VM state, migration state and VFIO device state as:
>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> 
> Live migration save path:
>         QEMU normal running state
>         (RUNNING, _NONE, _RUNNING)
>                         |
>     migrate_init spawns migration_thread.
>     (RUNNING, _SETUP, _RUNNING|_SAVING)
>     Migration thread then calls each device's .save_setup()
>                         |
>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
>     If device is active, get pending bytes by .save_live_pending()
>     if pending bytes >= threshold_size,  call save_live_iterate()
>     Data of VFIO device for pre-copy phase is copied.
>     Iterate till pending bytes converge and are less than threshold
>                         |
>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
>     for each active device. VFIO device is then transitioned in
>      _SAVING state.
>     (FINISH_MIGRATE, _DEVICE, _SAVING)
>     For VFIO device, iterate in  .save_live_complete_precopy  until
>     pending data is 0.
>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
>                         |
>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
>     Migraton thread schedule cleanup bottom half and exit
> 
> Live migration resume path:
>     Incomming migration calls .load_setup for each device
>     (RESTORE_VM, _ACTIVE, STOPPED)
>                         |
>     For each device, .load_state is called for that device section data
>                         |
>     At the end, called .load_cleanup for each device and vCPUs are started.
>                         |
>         (RUNNING, _NONE, _RUNNING)
> 
> Note that:
> - Migration post copy is not supported.
> 
> v6 -> v7:
> - Fix build failures.
> 
> v5 -> v6:
> - Fix build failure.
> 
> v4 -> v5:
> - Added decriptive comment about the sequence of access of members of structure
>   vfio_device_migration_info to be followed based on Alex's suggestion
> - Updated get dirty pages sequence.
> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
>   get_object, save_config and load_config.
> - Fixed multiple nit picks.
> - Tested live migration with multiple vfio device assigned to a VM.
> 
> v3 -> v4:
> - Added one more bit for _RESUMING flag to be set explicitly.
> - data_offset field is read-only for user space application.
> - data_size is read for every iteration before reading data from migration, that
>   is removed assumption that data will be till end of migration region.
> - If vendor driver supports mappable sparsed region, map those region during
>   setup state of save/load, similarly unmap those from cleanup routines.
> - Handles race condition that causes data corruption in migration region during
>   save device state by adding mutex and serialiaing save_buffer and
>   get_dirty_pages routines.
> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> - Added trace events.
> - Splitted into multiple functional patches.
> 
> v2 -> v3:
> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> - Re-structured vfio_device_migration_info to keep it minimal and defined action
>   on read and write access on its members.
> 
> v1 -> v2:
> - Defined MIGRATION region type and sub-type which should be used with region
>   type capability.
> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
>   offset of migration region.
> - Replaced ioctl with read/write for trapped part of migration region.
> - Added both type of access support, trapped or mmapped, for data section of the
>   region.
> - Moved PCI device functions to pci file.
> - Added iteration to get dirty page bitmap until bitmap for all requested pages
>   are copied.
> 
> Thanks,
> Kirti
> 
> Kirti Wankhede (13):
>   vfio: KABI for migration interface
>   vfio: Add function to unmap VFIO region
>   vfio: Add vfio_get_object callback to VFIODeviceOps
>   vfio: Add save and load functions for VFIO PCI devices
>   vfio: Add migration region initialization and finalize function
>   vfio: Add VM state change handler to know state of VM
>   vfio: Add migration state change notifier
>   vfio: Register SaveVMHandlers for VFIO device
>   vfio: Add save state functions to SaveVMHandlers
>   vfio: Add load state functions to SaveVMHandlers
>   vfio: Add function to get dirty page list
>   vfio: Add vfio_listerner_log_sync to mark dirty pages
>   vfio: Make vfio-pci device migration capable.
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  55 +++
>  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 | 137 ++++++-
>  hw/vfio/trace-events          |  19 +
>  include/hw/vfio/vfio-common.h |  25 ++
>  linux-headers/linux/vfio.h    | 166 ++++++++
>  7 files changed, 1271 insertions(+), 7 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-11  2:55 ` [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Yan Zhao
@ 2019-07-11 10:50   ` Dr. David Alan Gilbert
  2019-07-11 11:47     ` Yan Zhao
  0 siblings, 1 reply; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-11 10:50 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	alex.williamson, Wang, Zhi A, mlevitsk, pasic, aik,
	Kirti Wankhede, eauger, felipe, jonathan.davies, Liu, Changpeng,
	Ken.Xue

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> Hi Kirti,
> There are still unaddressed comments to your patches v4.
> Would you mind addressing them?
> 
> 1. should we register two migration interfaces simultaneously
> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)

Please don't do this.
As far as I'm aware we currently only have one device that does that
(vmxnet3) and a patch has just been posted that fixes/removes that.

Dave

> 2. in each save iteration, how much data is to be saved
> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
> 3. do we need extra interface to get data for device state only
> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
> 4. definition of dirty page copied_pfn
> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
> 
> Also, I'm glad to see that you updated code by following my comments below,
> but please don't forget to reply my comments next time:)
> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
> 
> Thanks
> Yan
> 
> On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> > Add migration support for VFIO device
> > 
> > This Patch set include patches as below:
> > - Define KABI for VFIO device for migration support.
> > - Added save and restore functions for PCI configuration space
> > - Generic migration functionality for VFIO device.
> >   * This patch set adds functionality only for PCI devices, but can be
> >     extended to other VFIO devices.
> >   * Added all the basic functions required for pre-copy, stop-and-copy and
> >     resume phases of migration.
> >   * Added state change notifier and from that notifier function, VFIO
> >     device's state changed is conveyed to VFIO device driver.
> >   * During save setup phase and resume/load setup phase, migration region
> >     is queried and is used to read/write VFIO device data.
> >   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> >     functionality of iteration during pre-copy phase.
> >   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> >     iteration to read data from VFIO device driver is implemented till pending
> >     bytes returned by driver are not zero.
> >   * Added function to get dirty pages bitmap for the pages which are used by
> >     driver.
> > - Add vfio_listerner_log_sync to mark dirty pages.
> > - Make VFIO PCI device migration capable. If migration region is not provided by
> >   driver, migration is blocked.
> > 
> > Below is the flow of state change for live migration where states in brackets
> > represent VM state, migration state and VFIO device state as:
> >     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > 
> > Live migration save path:
> >         QEMU normal running state
> >         (RUNNING, _NONE, _RUNNING)
> >                         |
> >     migrate_init spawns migration_thread.
> >     (RUNNING, _SETUP, _RUNNING|_SAVING)
> >     Migration thread then calls each device's .save_setup()
> >                         |
> >     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> >     If device is active, get pending bytes by .save_live_pending()
> >     if pending bytes >= threshold_size,  call save_live_iterate()
> >     Data of VFIO device for pre-copy phase is copied.
> >     Iterate till pending bytes converge and are less than threshold
> >                         |
> >     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> >     for each active device. VFIO device is then transitioned in
> >      _SAVING state.
> >     (FINISH_MIGRATE, _DEVICE, _SAVING)
> >     For VFIO device, iterate in  .save_live_complete_precopy  until
> >     pending data is 0.
> >     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> >                         |
> >     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> >     Migraton thread schedule cleanup bottom half and exit
> > 
> > Live migration resume path:
> >     Incomming migration calls .load_setup for each device
> >     (RESTORE_VM, _ACTIVE, STOPPED)
> >                         |
> >     For each device, .load_state is called for that device section data
> >                         |
> >     At the end, called .load_cleanup for each device and vCPUs are started.
> >                         |
> >         (RUNNING, _NONE, _RUNNING)
> > 
> > Note that:
> > - Migration post copy is not supported.
> > 
> > v6 -> v7:
> > - Fix build failures.
> > 
> > v5 -> v6:
> > - Fix build failure.
> > 
> > v4 -> v5:
> > - Added decriptive comment about the sequence of access of members of structure
> >   vfio_device_migration_info to be followed based on Alex's suggestion
> > - Updated get dirty pages sequence.
> > - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
> >   get_object, save_config and load_config.
> > - Fixed multiple nit picks.
> > - Tested live migration with multiple vfio device assigned to a VM.
> > 
> > v3 -> v4:
> > - Added one more bit for _RESUMING flag to be set explicitly.
> > - data_offset field is read-only for user space application.
> > - data_size is read for every iteration before reading data from migration, that
> >   is removed assumption that data will be till end of migration region.
> > - If vendor driver supports mappable sparsed region, map those region during
> >   setup state of save/load, similarly unmap those from cleanup routines.
> > - Handles race condition that causes data corruption in migration region during
> >   save device state by adding mutex and serialiaing save_buffer and
> >   get_dirty_pages routines.
> > - Skip called get_dirty_pages routine for mapped MMIO region of device.
> > - Added trace events.
> > - Splitted into multiple functional patches.
> > 
> > v2 -> v3:
> > - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> > - Re-structured vfio_device_migration_info to keep it minimal and defined action
> >   on read and write access on its members.
> > 
> > v1 -> v2:
> > - Defined MIGRATION region type and sub-type which should be used with region
> >   type capability.
> > - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> >   offset of migration region.
> > - Replaced ioctl with read/write for trapped part of migration region.
> > - Added both type of access support, trapped or mmapped, for data section of the
> >   region.
> > - Moved PCI device functions to pci file.
> > - Added iteration to get dirty page bitmap until bitmap for all requested pages
> >   are copied.
> > 
> > Thanks,
> > Kirti
> > 
> > Kirti Wankhede (13):
> >   vfio: KABI for migration interface
> >   vfio: Add function to unmap VFIO region
> >   vfio: Add vfio_get_object callback to VFIODeviceOps
> >   vfio: Add save and load functions for VFIO PCI devices
> >   vfio: Add migration region initialization and finalize function
> >   vfio: Add VM state change handler to know state of VM
> >   vfio: Add migration state change notifier
> >   vfio: Register SaveVMHandlers for VFIO device
> >   vfio: Add save state functions to SaveVMHandlers
> >   vfio: Add load state functions to SaveVMHandlers
> >   vfio: Add function to get dirty page list
> >   vfio: Add vfio_listerner_log_sync to mark dirty pages
> >   vfio: Make vfio-pci device migration capable.
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  55 +++
> >  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 | 137 ++++++-
> >  hw/vfio/trace-events          |  19 +
> >  include/hw/vfio/vfio-common.h |  25 ++
> >  linux-headers/linux/vfio.h    | 166 ++++++++
> >  7 files changed, 1271 insertions(+), 7 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > -- 
> > 2.7.0
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-11 10:50   ` Dr. David Alan Gilbert
@ 2019-07-11 11:47     ` Yan Zhao
  2019-07-11 16:23       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-11 11:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	alex.williamson, Wang, Zhi A, mlevitsk, pasic, aik,
	Kirti Wankhede, eauger, felipe, jonathan.davies, Liu, Changpeng,
	Ken.Xue

On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > Hi Kirti,
> > There are still unaddressed comments to your patches v4.
> > Would you mind addressing them?
> > 
> > 1. should we register two migration interfaces simultaneously
> > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
> 
> Please don't do this.
> As far as I'm aware we currently only have one device that does that
> (vmxnet3) and a patch has just been posted that fixes/removes that.
> 
> Dave
>
hi Dave,
Thanks for notifying this. but if we want to support postcopy in future,
after device stops, what interface could we use to transfer data of
device state only?
for postcopy, when source device stops, we need to transfer only
necessary device state to target vm before target vm starts, and we
don't want to transfer device memory as we'll do that after target vm
resuming.

Thanks
Yan

> > 2. in each save iteration, how much data is to be saved
> > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
> > 3. do we need extra interface to get data for device state only
> > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
> > 4. definition of dirty page copied_pfn
> > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
> > 
> > Also, I'm glad to see that you updated code by following my comments below,
> > but please don't forget to reply my comments next time:)
> > https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
> > https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
> > 
> > Thanks
> > Yan
> > 
> > On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> > > Add migration support for VFIO device
> > > 
> > > This Patch set include patches as below:
> > > - Define KABI for VFIO device for migration support.
> > > - Added save and restore functions for PCI configuration space
> > > - Generic migration functionality for VFIO device.
> > >   * This patch set adds functionality only for PCI devices, but can be
> > >     extended to other VFIO devices.
> > >   * Added all the basic functions required for pre-copy, stop-and-copy and
> > >     resume phases of migration.
> > >   * Added state change notifier and from that notifier function, VFIO
> > >     device's state changed is conveyed to VFIO device driver.
> > >   * During save setup phase and resume/load setup phase, migration region
> > >     is queried and is used to read/write VFIO device data.
> > >   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> > >     functionality of iteration during pre-copy phase.
> > >   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> > >     iteration to read data from VFIO device driver is implemented till pending
> > >     bytes returned by driver are not zero.
> > >   * Added function to get dirty pages bitmap for the pages which are used by
> > >     driver.
> > > - Add vfio_listerner_log_sync to mark dirty pages.
> > > - Make VFIO PCI device migration capable. If migration region is not provided by
> > >   driver, migration is blocked.
> > > 
> > > Below is the flow of state change for live migration where states in brackets
> > > represent VM state, migration state and VFIO device state as:
> > >     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > > 
> > > Live migration save path:
> > >         QEMU normal running state
> > >         (RUNNING, _NONE, _RUNNING)
> > >                         |
> > >     migrate_init spawns migration_thread.
> > >     (RUNNING, _SETUP, _RUNNING|_SAVING)
> > >     Migration thread then calls each device's .save_setup()
> > >                         |
> > >     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> > >     If device is active, get pending bytes by .save_live_pending()
> > >     if pending bytes >= threshold_size,  call save_live_iterate()
> > >     Data of VFIO device for pre-copy phase is copied.
> > >     Iterate till pending bytes converge and are less than threshold
> > >                         |
> > >     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> > >     for each active device. VFIO device is then transitioned in
> > >      _SAVING state.
> > >     (FINISH_MIGRATE, _DEVICE, _SAVING)
> > >     For VFIO device, iterate in  .save_live_complete_precopy  until
> > >     pending data is 0.
> > >     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> > >                         |
> > >     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> > >     Migraton thread schedule cleanup bottom half and exit
> > > 
> > > Live migration resume path:
> > >     Incomming migration calls .load_setup for each device
> > >     (RESTORE_VM, _ACTIVE, STOPPED)
> > >                         |
> > >     For each device, .load_state is called for that device section data
> > >                         |
> > >     At the end, called .load_cleanup for each device and vCPUs are started.
> > >                         |
> > >         (RUNNING, _NONE, _RUNNING)
> > > 
> > > Note that:
> > > - Migration post copy is not supported.
> > > 
> > > v6 -> v7:
> > > - Fix build failures.
> > > 
> > > v5 -> v6:
> > > - Fix build failure.
> > > 
> > > v4 -> v5:
> > > - Added decriptive comment about the sequence of access of members of structure
> > >   vfio_device_migration_info to be followed based on Alex's suggestion
> > > - Updated get dirty pages sequence.
> > > - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
> > >   get_object, save_config and load_config.
> > > - Fixed multiple nit picks.
> > > - Tested live migration with multiple vfio device assigned to a VM.
> > > 
> > > v3 -> v4:
> > > - Added one more bit for _RESUMING flag to be set explicitly.
> > > - data_offset field is read-only for user space application.
> > > - data_size is read for every iteration before reading data from migration, that
> > >   is removed assumption that data will be till end of migration region.
> > > - If vendor driver supports mappable sparsed region, map those region during
> > >   setup state of save/load, similarly unmap those from cleanup routines.
> > > - Handles race condition that causes data corruption in migration region during
> > >   save device state by adding mutex and serialiaing save_buffer and
> > >   get_dirty_pages routines.
> > > - Skip called get_dirty_pages routine for mapped MMIO region of device.
> > > - Added trace events.
> > > - Splitted into multiple functional patches.
> > > 
> > > v2 -> v3:
> > > - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> > > - Re-structured vfio_device_migration_info to keep it minimal and defined action
> > >   on read and write access on its members.
> > > 
> > > v1 -> v2:
> > > - Defined MIGRATION region type and sub-type which should be used with region
> > >   type capability.
> > > - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> > >   offset of migration region.
> > > - Replaced ioctl with read/write for trapped part of migration region.
> > > - Added both type of access support, trapped or mmapped, for data section of the
> > >   region.
> > > - Moved PCI device functions to pci file.
> > > - Added iteration to get dirty page bitmap until bitmap for all requested pages
> > >   are copied.
> > > 
> > > Thanks,
> > > Kirti
> > > 
> > > Kirti Wankhede (13):
> > >   vfio: KABI for migration interface
> > >   vfio: Add function to unmap VFIO region
> > >   vfio: Add vfio_get_object callback to VFIODeviceOps
> > >   vfio: Add save and load functions for VFIO PCI devices
> > >   vfio: Add migration region initialization and finalize function
> > >   vfio: Add VM state change handler to know state of VM
> > >   vfio: Add migration state change notifier
> > >   vfio: Register SaveVMHandlers for VFIO device
> > >   vfio: Add save state functions to SaveVMHandlers
> > >   vfio: Add load state functions to SaveVMHandlers
> > >   vfio: Add function to get dirty page list
> > >   vfio: Add vfio_listerner_log_sync to mark dirty pages
> > >   vfio: Make vfio-pci device migration capable.
> > > 
> > >  hw/vfio/Makefile.objs         |   2 +-
> > >  hw/vfio/common.c              |  55 +++
> > >  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c                 | 137 ++++++-
> > >  hw/vfio/trace-events          |  19 +
> > >  include/hw/vfio/vfio-common.h |  25 ++
> > >  linux-headers/linux/vfio.h    | 166 ++++++++
> > >  7 files changed, 1271 insertions(+), 7 deletions(-)
> > >  create mode 100644 hw/vfio/migration.c
> > > 
> > > -- 
> > > 2.7.0
> > > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2019-07-11 12:07   ` Dr. David Alan Gilbert
  2019-08-22  4:50     ` Kirti Wankhede
  2019-07-16 21:14   ` Alex Williamson
  1 sibling, 1 reply; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-11 12:07 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> These functions save and restore PCI device specific data - config
> space of PCI device.
> Tested save and restore with MSI and MSIX type.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c                 | 114 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   2 +
>  2 files changed, 116 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index de0d286fc9dd..5fe4f8076cac 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2395,11 +2395,125 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>      return OBJECT(vdev);
>  }
>  
> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint16_t pci_cmd;
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar;
> +
> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar);
> +    }
> +
> +    qemu_put_be32(f, vdev->interrupt);
> +    if (vdev->interrupt == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                                            2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        qemu_put_be32(f, msi_addr_lo);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                             4);
> +        }
> +        qemu_put_be32(f, msi_addr_hi);
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +        qemu_put_be32(f, msi_data);
> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> +        uint16_t offset;
> +
> +        /* save enable bit and maskall bit */
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        qemu_put_be16(f, offset);
> +        msix_save(pdev, f);
> +    }
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    qemu_put_be16(f, pci_cmd);
> +}
> +
> +static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +    uint16_t pci_cmd;
> +    bool msi_64bit;
> +    int i;
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = qemu_get_be32(f);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }

Is it possible to validate the bar's at all?  We just had a bug on a
virtual device where one version was asking for a larger bar than the
other; our validation caught this in some cases so we could tell that
the guest had a BAR that was aligned at the wrong alignment.

> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);

Can you explain what this is for?  You write the command register at the
end of the function with the original value; there's no guarantee that
the device is using IO for example, so ORing it seems odd.

Also, are the other flags in COMMAND safe at this point - e.g. what
about interrupts and stuff?

> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +        msi_addr_lo = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        msi_addr_hi = qemu_get_be32(f);
> +        if (msi_64bit) {
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +        msi_data = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset = qemu_get_be16(f);
> +
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +        msix_load(pdev, f);
> +    }
> +    pci_cmd = qemu_get_be16(f);
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +}
> +
>  static VFIODeviceOps vfio_pci_ops = {
>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>      .vfio_eoi = vfio_intx_eoi,
>      .vfio_get_object = vfio_pci_get_object,
> +    .vfio_save_config = vfio_pci_save_config,
> +    .vfio_load_config = vfio_pci_load_config,
>  };
>  
>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 771b6d59a3db..ee72bd984a36 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>      void (*vfio_eoi)(VFIODevice *vdev);
>      Object *(*vfio_get_object)(VFIODevice *vdev);
> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> +    void (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>  };
>  
>  typedef struct VFIOGroup {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2019-07-11 12:13   ` Dr. David Alan Gilbert
  2019-07-11 19:14     ` Kirti Wankhede
  2019-07-16 22:03   ` Alex Williamson
  2019-07-22  8:37   ` Yan Zhao
  2 siblings, 1 reply; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-11 12:13 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> VM state change handler gets called on change in VM's state. This is used to set
> VFIO device state to _RUNNING.
> VM state change handler, migration state change handler and log_sync listener
> are called asynchronously, which sometimes lead to data corruption in migration
> region. Initialised mutex that is used to serialize operations on migration data
> region during saving state.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 64 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  2 ++
>  include/hw/vfio/vfio-common.h |  4 +++
>  3 files changed, 70 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index a2cfbd5af2e1..c01f08b659d0 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -78,6 +78,60 @@ err:
>      return ret;
>  }
>  
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint32_t device_state;
> +    int ret = 0;
> +
> +    device_state = (state & VFIO_DEVICE_STATE_MASK) |
> +                   (vbasedev->device_state & ~VFIO_DEVICE_STATE_MASK);
> +
> +    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
> +        return -EINVAL;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("%s: Failed to set device state %d %s",
> +                     vbasedev->name, ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    vbasedev->device_state = device_state;
> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> +    return 0;
> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running)) {
> +        int ret;
> +        uint32_t dev_state;
> +
> +        if (running) {
> +            dev_state = VFIO_DEVICE_STATE_RUNNING;
> +        } else {
> +            dev_state = (vbasedev->device_state & VFIO_DEVICE_STATE_MASK) &
> +                     ~VFIO_DEVICE_STATE_RUNNING;
> +        }
> +
> +        ret = vfio_migration_set_state(vbasedev, dev_state);
> +        if (ret) {
> +            error_report("%s: Failed to set device state 0x%x",
> +                         vbasedev->name, dev_state);
> +        }
> +        vbasedev->vm_running = running;
> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                                  dev_state);
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -93,6 +147,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    qemu_mutex_init(&vbasedev->migration->lock);

Does this and it's friend below belong in this patch?  As far as I can
tell you init/deinit the lock here but don't use it which is strange.

Dave

> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
> +
>      return 0;
>  }
>  
> @@ -135,11 +194,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>          return;
>      }
>  
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +    }
> +
>      if (vbasedev->migration_blocker) {
>          migrate_del_blocker(vbasedev->migration_blocker);
>          error_free(vbasedev->migration_blocker);
>      }
>  
> +    qemu_mutex_destroy(&vbasedev->migration->lock);
>      vfio_migration_region_exit(vbasedev);
>      g_free(vbasedev->migration);
>  }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 191a726a1312..3d15bacd031a 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 152da3f8d6f3..f6c70db3a9c1 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -29,6 +29,7 @@
>  #ifdef CONFIG_LINUX
>  #include <linux/vfio.h>
>  #endif
> +#include "sysemu/sysemu.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -124,6 +125,9 @@ typedef struct VFIODevice {
>      unsigned int flags;
>      VFIOMigration *migration;
>      Error *migration_blocker;
> +    uint32_t device_state;
> +    VMChangeStateEntry *vm_state;
> +    int vm_running;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-11 11:47     ` Yan Zhao
@ 2019-07-11 16:23       ` Dr. David Alan Gilbert
  2019-07-11 19:08         ` Kirti Wankhede
  2019-07-12  0:14         ` Yan Zhao
  0 siblings, 2 replies; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-11 16:23 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	alex.williamson, Wang, Zhi A, mlevitsk, pasic, aik,
	Kirti Wankhede, eauger, felipe, jonathan.davies, Liu, Changpeng,
	Ken.Xue

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > Hi Kirti,
> > > There are still unaddressed comments to your patches v4.
> > > Would you mind addressing them?
> > > 
> > > 1. should we register two migration interfaces simultaneously
> > > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
> > 
> > Please don't do this.
> > As far as I'm aware we currently only have one device that does that
> > (vmxnet3) and a patch has just been posted that fixes/removes that.
> > 
> > Dave
> >
> hi Dave,
> Thanks for notifying this. but if we want to support postcopy in future,
> after device stops, what interface could we use to transfer data of
> device state only?
> for postcopy, when source device stops, we need to transfer only
> necessary device state to target vm before target vm starts, and we
> don't want to transfer device memory as we'll do that after target vm
> resuming.

Hmm ok, lets see; that's got to happen in the call to:
    qemu_savevm_state_complete_precopy(fb, false, false);
that's made from postcopy_start.
 (the false's are iterable_only and inactivate_disks)

and at that time I believe the state is POSTCOPY_ACTIVE, so in_postcopy
is true.

If you're doing postcopy, then you'll probably define a has_postcopy()
function, so qemu_savevm_state_complete_precopy will skip the
save_live_complete_precopy call from it's loop for at least two of the
reasons in it's big if.

So you're right; you need the VMSD for this to happen in the second
loop in qemu_savevm_state_complete_precopy.  Hmm.

Now, what worries me, and I don't know the answer, is how the section
header for the vmstate and the section header for an iteration look
on the stream; how are they different?

Dave

> Thanks
> Yan
> 
> > > 2. in each save iteration, how much data is to be saved
> > > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
> > > 3. do we need extra interface to get data for device state only
> > > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
> > > 4. definition of dirty page copied_pfn
> > > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
> > > 
> > > Also, I'm glad to see that you updated code by following my comments below,
> > > but please don't forget to reply my comments next time:)
> > > https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
> > > https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
> > > 
> > > Thanks
> > > Yan
> > > 
> > > On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> > > > Add migration support for VFIO device
> > > > 
> > > > This Patch set include patches as below:
> > > > - Define KABI for VFIO device for migration support.
> > > > - Added save and restore functions for PCI configuration space
> > > > - Generic migration functionality for VFIO device.
> > > >   * This patch set adds functionality only for PCI devices, but can be
> > > >     extended to other VFIO devices.
> > > >   * Added all the basic functions required for pre-copy, stop-and-copy and
> > > >     resume phases of migration.
> > > >   * Added state change notifier and from that notifier function, VFIO
> > > >     device's state changed is conveyed to VFIO device driver.
> > > >   * During save setup phase and resume/load setup phase, migration region
> > > >     is queried and is used to read/write VFIO device data.
> > > >   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> > > >     functionality of iteration during pre-copy phase.
> > > >   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> > > >     iteration to read data from VFIO device driver is implemented till pending
> > > >     bytes returned by driver are not zero.
> > > >   * Added function to get dirty pages bitmap for the pages which are used by
> > > >     driver.
> > > > - Add vfio_listerner_log_sync to mark dirty pages.
> > > > - Make VFIO PCI device migration capable. If migration region is not provided by
> > > >   driver, migration is blocked.
> > > > 
> > > > Below is the flow of state change for live migration where states in brackets
> > > > represent VM state, migration state and VFIO device state as:
> > > >     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > > > 
> > > > Live migration save path:
> > > >         QEMU normal running state
> > > >         (RUNNING, _NONE, _RUNNING)
> > > >                         |
> > > >     migrate_init spawns migration_thread.
> > > >     (RUNNING, _SETUP, _RUNNING|_SAVING)
> > > >     Migration thread then calls each device's .save_setup()
> > > >                         |
> > > >     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> > > >     If device is active, get pending bytes by .save_live_pending()
> > > >     if pending bytes >= threshold_size,  call save_live_iterate()
> > > >     Data of VFIO device for pre-copy phase is copied.
> > > >     Iterate till pending bytes converge and are less than threshold
> > > >                         |
> > > >     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> > > >     for each active device. VFIO device is then transitioned in
> > > >      _SAVING state.
> > > >     (FINISH_MIGRATE, _DEVICE, _SAVING)
> > > >     For VFIO device, iterate in  .save_live_complete_precopy  until
> > > >     pending data is 0.
> > > >     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> > > >                         |
> > > >     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> > > >     Migraton thread schedule cleanup bottom half and exit
> > > > 
> > > > Live migration resume path:
> > > >     Incomming migration calls .load_setup for each device
> > > >     (RESTORE_VM, _ACTIVE, STOPPED)
> > > >                         |
> > > >     For each device, .load_state is called for that device section data
> > > >                         |
> > > >     At the end, called .load_cleanup for each device and vCPUs are started.
> > > >                         |
> > > >         (RUNNING, _NONE, _RUNNING)
> > > > 
> > > > Note that:
> > > > - Migration post copy is not supported.
> > > > 
> > > > v6 -> v7:
> > > > - Fix build failures.
> > > > 
> > > > v5 -> v6:
> > > > - Fix build failure.
> > > > 
> > > > v4 -> v5:
> > > > - Added decriptive comment about the sequence of access of members of structure
> > > >   vfio_device_migration_info to be followed based on Alex's suggestion
> > > > - Updated get dirty pages sequence.
> > > > - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
> > > >   get_object, save_config and load_config.
> > > > - Fixed multiple nit picks.
> > > > - Tested live migration with multiple vfio device assigned to a VM.
> > > > 
> > > > v3 -> v4:
> > > > - Added one more bit for _RESUMING flag to be set explicitly.
> > > > - data_offset field is read-only for user space application.
> > > > - data_size is read for every iteration before reading data from migration, that
> > > >   is removed assumption that data will be till end of migration region.
> > > > - If vendor driver supports mappable sparsed region, map those region during
> > > >   setup state of save/load, similarly unmap those from cleanup routines.
> > > > - Handles race condition that causes data corruption in migration region during
> > > >   save device state by adding mutex and serialiaing save_buffer and
> > > >   get_dirty_pages routines.
> > > > - Skip called get_dirty_pages routine for mapped MMIO region of device.
> > > > - Added trace events.
> > > > - Splitted into multiple functional patches.
> > > > 
> > > > v2 -> v3:
> > > > - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> > > > - Re-structured vfio_device_migration_info to keep it minimal and defined action
> > > >   on read and write access on its members.
> > > > 
> > > > v1 -> v2:
> > > > - Defined MIGRATION region type and sub-type which should be used with region
> > > >   type capability.
> > > > - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> > > >   offset of migration region.
> > > > - Replaced ioctl with read/write for trapped part of migration region.
> > > > - Added both type of access support, trapped or mmapped, for data section of the
> > > >   region.
> > > > - Moved PCI device functions to pci file.
> > > > - Added iteration to get dirty page bitmap until bitmap for all requested pages
> > > >   are copied.
> > > > 
> > > > Thanks,
> > > > Kirti
> > > > 
> > > > Kirti Wankhede (13):
> > > >   vfio: KABI for migration interface
> > > >   vfio: Add function to unmap VFIO region
> > > >   vfio: Add vfio_get_object callback to VFIODeviceOps
> > > >   vfio: Add save and load functions for VFIO PCI devices
> > > >   vfio: Add migration region initialization and finalize function
> > > >   vfio: Add VM state change handler to know state of VM
> > > >   vfio: Add migration state change notifier
> > > >   vfio: Register SaveVMHandlers for VFIO device
> > > >   vfio: Add save state functions to SaveVMHandlers
> > > >   vfio: Add load state functions to SaveVMHandlers
> > > >   vfio: Add function to get dirty page list
> > > >   vfio: Add vfio_listerner_log_sync to mark dirty pages
> > > >   vfio: Make vfio-pci device migration capable.
> > > > 
> > > >  hw/vfio/Makefile.objs         |   2 +-
> > > >  hw/vfio/common.c              |  55 +++
> > > >  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c                 | 137 ++++++-
> > > >  hw/vfio/trace-events          |  19 +
> > > >  include/hw/vfio/vfio-common.h |  25 ++
> > > >  linux-headers/linux/vfio.h    | 166 ++++++++
> > > >  7 files changed, 1271 insertions(+), 7 deletions(-)
> > > >  create mode 100644 hw/vfio/migration.c
> > > > 
> > > > -- 
> > > > 2.7.0
> > > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-11 16:23       ` Dr. David Alan Gilbert
@ 2019-07-11 19:08         ` Kirti Wankhede
  2019-07-12  0:32           ` Yan Zhao
  2019-07-12 17:42           ` Dr. David Alan Gilbert
  2019-07-12  0:14         ` Yan Zhao
  1 sibling, 2 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-11 19:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Yan Zhao
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, cohuck, shuangtai.tst, qemu-devel, Wang,
	 Zhi A, mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 7/11/2019 9:53 PM, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
>> On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
>>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
>>>> Hi Kirti,
>>>> There are still unaddressed comments to your patches v4.
>>>> Would you mind addressing them?
>>>>
>>>> 1. should we register two migration interfaces simultaneously
>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
>>>
>>> Please don't do this.
>>> As far as I'm aware we currently only have one device that does that
>>> (vmxnet3) and a patch has just been posted that fixes/removes that.
>>>
>>> Dave
>>>
>> hi Dave,
>> Thanks for notifying this. but if we want to support postcopy in future,
>> after device stops, what interface could we use to transfer data of
>> device state only?
>> for postcopy, when source device stops, we need to transfer only
>> necessary device state to target vm before target vm starts, and we
>> don't want to transfer device memory as we'll do that after target vm
>> resuming.
> 
> Hmm ok, lets see; that's got to happen in the call to:
>     qemu_savevm_state_complete_precopy(fb, false, false);
> that's made from postcopy_start.
>  (the false's are iterable_only and inactivate_disks)
> 
> and at that time I believe the state is POSTCOPY_ACTIVE, so in_postcopy
> is true.
> 
> If you're doing postcopy, then you'll probably define a has_postcopy()
> function, so qemu_savevm_state_complete_precopy will skip the
> save_live_complete_precopy call from it's loop for at least two of the
> reasons in it's big if.
> 
> So you're right; you need the VMSD for this to happen in the second
> loop in qemu_savevm_state_complete_precopy.  Hmm.
> 
> Now, what worries me, and I don't know the answer, is how the section
> header for the vmstate and the section header for an iteration look
> on the stream; how are they different?
> 

I don't have way to test postcopy migration - is one of the major reason
I had not included postcopy support in this patchset and clearly called
out in cover letter.
This patchset is thoroughly tested for precopy migration.
If anyone have hardware that supports fault, then I would prefer to add
postcopy support as incremental change later which can be tested before
submitting.

Just a suggestion, instead of using VMSD, is it possible to have some
additional check to call save_live_complete_precopy from
qemu_savevm_state_complete_precopy?


>>
>>>> 2. in each save iteration, how much data is to be saved
>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)

> how big is the data_size ?
> if this size is too big, it may take too much time and block others.

I do had mentioned this in the comment about the structure in vfio.h
header. data_size will be provided by vendor driver and obviously will
not be greater that migration region size. Vendor driver should be
responsible to keep its solution optimized.


>>>> 3. do we need extra interface to get data for device state only
>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)

I don't think so. Opaque Device data from vendor driver can include
device state and device memory. Vendor driver who is managing his device
can decide how to place data over the stream.

>>>> 4. definition of dirty page copied_pfn
>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
>>>>

This was inline to discussion going with Alex. I addressed the concern
there. Please check current patchset, which addresses the concerns raised.

>>>> Also, I'm glad to see that you updated code by following my comments below,
>>>> but please don't forget to reply my comments next time:)

I tried to reply top of threads and addressed common concerns raised in
that. Sorry If I missed any, I'll make sure to point you to my replies
going ahead.

Thanks,
Kirti

>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
>>>>
>>>> Thanks
>>>> Yan
>>>>
>>>> On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
>>>>> Add migration support for VFIO device
>>>>>
>>>>> This Patch set include patches as below:
>>>>> - Define KABI for VFIO device for migration support.
>>>>> - Added save and restore functions for PCI configuration space
>>>>> - Generic migration functionality for VFIO device.
>>>>>   * This patch set adds functionality only for PCI devices, but can be
>>>>>     extended to other VFIO devices.
>>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
>>>>>     resume phases of migration.
>>>>>   * Added state change notifier and from that notifier function, VFIO
>>>>>     device's state changed is conveyed to VFIO device driver.
>>>>>   * During save setup phase and resume/load setup phase, migration region
>>>>>     is queried and is used to read/write VFIO device data.
>>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
>>>>>     functionality of iteration during pre-copy phase.
>>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>>>>>     iteration to read data from VFIO device driver is implemented till pending
>>>>>     bytes returned by driver are not zero.
>>>>>   * Added function to get dirty pages bitmap for the pages which are used by
>>>>>     driver.
>>>>> - Add vfio_listerner_log_sync to mark dirty pages.
>>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
>>>>>   driver, migration is blocked.
>>>>>
>>>>> Below is the flow of state change for live migration where states in brackets
>>>>> represent VM state, migration state and VFIO device state as:
>>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
>>>>>
>>>>> Live migration save path:
>>>>>         QEMU normal running state
>>>>>         (RUNNING, _NONE, _RUNNING)
>>>>>                         |
>>>>>     migrate_init spawns migration_thread.
>>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
>>>>>     Migration thread then calls each device's .save_setup()
>>>>>                         |
>>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
>>>>>     If device is active, get pending bytes by .save_live_pending()
>>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
>>>>>     Data of VFIO device for pre-copy phase is copied.
>>>>>     Iterate till pending bytes converge and are less than threshold
>>>>>                         |
>>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
>>>>>     for each active device. VFIO device is then transitioned in
>>>>>      _SAVING state.
>>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
>>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
>>>>>     pending data is 0.
>>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
>>>>>                         |
>>>>>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
>>>>>     Migraton thread schedule cleanup bottom half and exit
>>>>>
>>>>> Live migration resume path:
>>>>>     Incomming migration calls .load_setup for each device
>>>>>     (RESTORE_VM, _ACTIVE, STOPPED)
>>>>>                         |
>>>>>     For each device, .load_state is called for that device section data
>>>>>                         |
>>>>>     At the end, called .load_cleanup for each device and vCPUs are started.
>>>>>                         |
>>>>>         (RUNNING, _NONE, _RUNNING)
>>>>>
>>>>> Note that:
>>>>> - Migration post copy is not supported.
>>>>>
>>>>> v6 -> v7:
>>>>> - Fix build failures.
>>>>>
>>>>> v5 -> v6:
>>>>> - Fix build failure.
>>>>>
>>>>> v4 -> v5:
>>>>> - Added decriptive comment about the sequence of access of members of structure
>>>>>   vfio_device_migration_info to be followed based on Alex's suggestion
>>>>> - Updated get dirty pages sequence.
>>>>> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
>>>>>   get_object, save_config and load_config.
>>>>> - Fixed multiple nit picks.
>>>>> - Tested live migration with multiple vfio device assigned to a VM.
>>>>>
>>>>> v3 -> v4:
>>>>> - Added one more bit for _RESUMING flag to be set explicitly.
>>>>> - data_offset field is read-only for user space application.
>>>>> - data_size is read for every iteration before reading data from migration, that
>>>>>   is removed assumption that data will be till end of migration region.
>>>>> - If vendor driver supports mappable sparsed region, map those region during
>>>>>   setup state of save/load, similarly unmap those from cleanup routines.
>>>>> - Handles race condition that causes data corruption in migration region during
>>>>>   save device state by adding mutex and serialiaing save_buffer and
>>>>>   get_dirty_pages routines.
>>>>> - Skip called get_dirty_pages routine for mapped MMIO region of device.
>>>>> - Added trace events.
>>>>> - Splitted into multiple functional patches.
>>>>>
>>>>> v2 -> v3:
>>>>> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
>>>>> - Re-structured vfio_device_migration_info to keep it minimal and defined action
>>>>>   on read and write access on its members.
>>>>>
>>>>> v1 -> v2:
>>>>> - Defined MIGRATION region type and sub-type which should be used with region
>>>>>   type capability.
>>>>> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
>>>>>   offset of migration region.
>>>>> - Replaced ioctl with read/write for trapped part of migration region.
>>>>> - Added both type of access support, trapped or mmapped, for data section of the
>>>>>   region.
>>>>> - Moved PCI device functions to pci file.
>>>>> - Added iteration to get dirty page bitmap until bitmap for all requested pages
>>>>>   are copied.
>>>>>
>>>>> Thanks,
>>>>> Kirti
>>>>>
>>>>> Kirti Wankhede (13):
>>>>>   vfio: KABI for migration interface
>>>>>   vfio: Add function to unmap VFIO region
>>>>>   vfio: Add vfio_get_object callback to VFIODeviceOps
>>>>>   vfio: Add save and load functions for VFIO PCI devices
>>>>>   vfio: Add migration region initialization and finalize function
>>>>>   vfio: Add VM state change handler to know state of VM
>>>>>   vfio: Add migration state change notifier
>>>>>   vfio: Register SaveVMHandlers for VFIO device
>>>>>   vfio: Add save state functions to SaveVMHandlers
>>>>>   vfio: Add load state functions to SaveVMHandlers
>>>>>   vfio: Add function to get dirty page list
>>>>>   vfio: Add vfio_listerner_log_sync to mark dirty pages
>>>>>   vfio: Make vfio-pci device migration capable.
>>>>>
>>>>>  hw/vfio/Makefile.objs         |   2 +-
>>>>>  hw/vfio/common.c              |  55 +++
>>>>>  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
>>>>>  hw/vfio/pci.c                 | 137 ++++++-
>>>>>  hw/vfio/trace-events          |  19 +
>>>>>  include/hw/vfio/vfio-common.h |  25 ++
>>>>>  linux-headers/linux/vfio.h    | 166 ++++++++
>>>>>  7 files changed, 1271 insertions(+), 7 deletions(-)
>>>>>  create mode 100644 hw/vfio/migration.c
>>>>>
>>>>> -- 
>>>>> 2.7.0
>>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM
  2019-07-11 12:13   ` Dr. David Alan Gilbert
@ 2019-07-11 19:14     ` Kirti Wankhede
  2019-07-22  8:23       ` Yan Zhao
  0 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-11 19:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 7/11/2019 5:43 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> VM state change handler gets called on change in VM's state. This is used to set
>> VFIO device state to _RUNNING.
>> VM state change handler, migration state change handler and log_sync listener
>> are called asynchronously, which sometimes lead to data corruption in migration
>> region. Initialised mutex that is used to serialize operations on migration data
>> region during saving state.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c           | 64 +++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |  2 ++
>>  include/hw/vfio/vfio-common.h |  4 +++
>>  3 files changed, 70 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index a2cfbd5af2e1..c01f08b659d0 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -78,6 +78,60 @@ err:
>>      return ret;
>>  }
>>  
>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint32_t device_state;
>> +    int ret = 0;
>> +
>> +    device_state = (state & VFIO_DEVICE_STATE_MASK) |
>> +                   (vbasedev->device_state & ~VFIO_DEVICE_STATE_MASK);
>> +
>> +    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              device_state));
>> +    if (ret < 0) {
>> +        error_report("%s: Failed to set device state %d %s",
>> +                     vbasedev->name, ret, strerror(errno));
>> +        return ret;
>> +    }
>> +
>> +    vbasedev->device_state = device_state;
>> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
>> +    return 0;
>> +}
>> +
>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    if ((vbasedev->vm_running != running)) {
>> +        int ret;
>> +        uint32_t dev_state;
>> +
>> +        if (running) {
>> +            dev_state = VFIO_DEVICE_STATE_RUNNING;
>> +        } else {
>> +            dev_state = (vbasedev->device_state & VFIO_DEVICE_STATE_MASK) &
>> +                     ~VFIO_DEVICE_STATE_RUNNING;
>> +        }
>> +
>> +        ret = vfio_migration_set_state(vbasedev, dev_state);
>> +        if (ret) {
>> +            error_report("%s: Failed to set device state 0x%x",
>> +                         vbasedev->name, dev_state);
>> +        }
>> +        vbasedev->vm_running = running;
>> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
>> +                                  dev_state);
>> +    }
>> +}
>> +
>>  static int vfio_migration_init(VFIODevice *vbasedev,
>>                                 struct vfio_region_info *info)
>>  {
>> @@ -93,6 +147,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>          return ret;
>>      }
>>  
>> +    qemu_mutex_init(&vbasedev->migration->lock);
> 
> Does this and it's friend below belong in this patch?  As far as I can
> tell you init/deinit the lock here but don't use it which is strange.
> 

This lock is used in
0009-vfio-Add-save-state-functions-to-SaveVMHandlers.patch and
0011-vfio-Add-function-to-get-dirty-page-list.patch

Hm. I'll move this init/deinit to patch 0009 in next iteration.

Thanks,
Kirti


> Dave
> 
>> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>> +                                                          vbasedev);
>> +
>>      return 0;
>>  }
>>  
>> @@ -135,11 +194,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>>          return;
>>      }
>>  
>> +    if (vbasedev->vm_state) {
>> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
>> +    }
>> +
>>      if (vbasedev->migration_blocker) {
>>          migrate_del_blocker(vbasedev->migration_blocker);
>>          error_free(vbasedev->migration_blocker);
>>      }
>>  
>> +    qemu_mutex_destroy(&vbasedev->migration->lock);
>>      vfio_migration_region_exit(vbasedev);
>>      g_free(vbasedev->migration);
>>  }
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 191a726a1312..3d15bacd031a 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
>>  
>>  # migration.c
>>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 152da3f8d6f3..f6c70db3a9c1 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -29,6 +29,7 @@
>>  #ifdef CONFIG_LINUX
>>  #include <linux/vfio.h>
>>  #endif
>> +#include "sysemu/sysemu.h"
>>  
>>  #define VFIO_MSG_PREFIX "vfio %s: "
>>  
>> @@ -124,6 +125,9 @@ typedef struct VFIODevice {
>>      unsigned int flags;
>>      VFIOMigration *migration;
>>      Error *migration_blocker;
>> +    uint32_t device_state;
>> +    VMChangeStateEntry *vm_state;
>> +    int vm_running;
>>  } VFIODevice;
>>  
>>  struct VFIODeviceOps {
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-11 16:23       ` Dr. David Alan Gilbert
  2019-07-11 19:08         ` Kirti Wankhede
@ 2019-07-12  0:14         ` Yan Zhao
  1 sibling, 0 replies; 77+ messages in thread
From: Yan Zhao @ 2019-07-12  0:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	alex.williamson, Wang, Zhi A, mlevitsk, pasic, aik,
	Kirti Wankhede, eauger, felipe, jonathan.davies, Liu, Changpeng,
	Ken.Xue

On Fri, Jul 12, 2019 at 12:23:15AM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > Hi Kirti,
> > > > There are still unaddressed comments to your patches v4.
> > > > Would you mind addressing them?
> > > > 
> > > > 1. should we register two migration interfaces simultaneously
> > > > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
> > > 
> > > Please don't do this.
> > > As far as I'm aware we currently only have one device that does that
> > > (vmxnet3) and a patch has just been posted that fixes/removes that.
> > > 
> > > Dave
> > >
> > hi Dave,
> > Thanks for notifying this. but if we want to support postcopy in future,
> > after device stops, what interface could we use to transfer data of
> > device state only?
> > for postcopy, when source device stops, we need to transfer only
> > necessary device state to target vm before target vm starts, and we
> > don't want to transfer device memory as we'll do that after target vm
> > resuming.
> 
> Hmm ok, lets see; that's got to happen in the call to:
>     qemu_savevm_state_complete_precopy(fb, false, false);
> that's made from postcopy_start.
>  (the false's are iterable_only and inactivate_disks)
> 
> and at that time I believe the state is POSTCOPY_ACTIVE, so in_postcopy
> is true.
> 
> If you're doing postcopy, then you'll probably define a has_postcopy()
> function, so qemu_savevm_state_complete_precopy will skip the
> save_live_complete_precopy call from it's loop for at least two of the
> reasons in it's big if.
> 
> So you're right; you need the VMSD for this to happen in the second
> loop in qemu_savevm_state_complete_precopy.  Hmm.
> 
> Now, what worries me, and I don't know the answer, is how the section
> header for the vmstate and the section header for an iteration look
> on the stream; how are they different?
>
may we name one "vfio" and the other "vfio-vmsd", and let iteration
interface for device memory data and vmstate interface for device state
data?

Thanks
Yan
> Dave
> 
> > Thanks
> > Yan
> > 
> > > > 2. in each save iteration, how much data is to be saved
> > > > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
> > > > 3. do we need extra interface to get data for device state only
> > > > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
> > > > 4. definition of dirty page copied_pfn
> > > > (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
> > > > 
> > > > Also, I'm glad to see that you updated code by following my comments below,
> > > > but please don't forget to reply my comments next time:)
> > > > https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
> > > > https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
> > > > 
> > > > Thanks
> > > > Yan
> > > > 
> > > > On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> > > > > Add migration support for VFIO device
> > > > > 
> > > > > This Patch set include patches as below:
> > > > > - Define KABI for VFIO device for migration support.
> > > > > - Added save and restore functions for PCI configuration space
> > > > > - Generic migration functionality for VFIO device.
> > > > >   * This patch set adds functionality only for PCI devices, but can be
> > > > >     extended to other VFIO devices.
> > > > >   * Added all the basic functions required for pre-copy, stop-and-copy and
> > > > >     resume phases of migration.
> > > > >   * Added state change notifier and from that notifier function, VFIO
> > > > >     device's state changed is conveyed to VFIO device driver.
> > > > >   * During save setup phase and resume/load setup phase, migration region
> > > > >     is queried and is used to read/write VFIO device data.
> > > > >   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> > > > >     functionality of iteration during pre-copy phase.
> > > > >   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> > > > >     iteration to read data from VFIO device driver is implemented till pending
> > > > >     bytes returned by driver are not zero.
> > > > >   * Added function to get dirty pages bitmap for the pages which are used by
> > > > >     driver.
> > > > > - Add vfio_listerner_log_sync to mark dirty pages.
> > > > > - Make VFIO PCI device migration capable. If migration region is not provided by
> > > > >   driver, migration is blocked.
> > > > > 
> > > > > Below is the flow of state change for live migration where states in brackets
> > > > > represent VM state, migration state and VFIO device state as:
> > > > >     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > > > > 
> > > > > Live migration save path:
> > > > >         QEMU normal running state
> > > > >         (RUNNING, _NONE, _RUNNING)
> > > > >                         |
> > > > >     migrate_init spawns migration_thread.
> > > > >     (RUNNING, _SETUP, _RUNNING|_SAVING)
> > > > >     Migration thread then calls each device's .save_setup()
> > > > >                         |
> > > > >     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> > > > >     If device is active, get pending bytes by .save_live_pending()
> > > > >     if pending bytes >= threshold_size,  call save_live_iterate()
> > > > >     Data of VFIO device for pre-copy phase is copied.
> > > > >     Iterate till pending bytes converge and are less than threshold
> > > > >                         |
> > > > >     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> > > > >     for each active device. VFIO device is then transitioned in
> > > > >      _SAVING state.
> > > > >     (FINISH_MIGRATE, _DEVICE, _SAVING)
> > > > >     For VFIO device, iterate in  .save_live_complete_precopy  until
> > > > >     pending data is 0.
> > > > >     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> > > > >                         |
> > > > >     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> > > > >     Migraton thread schedule cleanup bottom half and exit
> > > > > 
> > > > > Live migration resume path:
> > > > >     Incomming migration calls .load_setup for each device
> > > > >     (RESTORE_VM, _ACTIVE, STOPPED)
> > > > >                         |
> > > > >     For each device, .load_state is called for that device section data
> > > > >                         |
> > > > >     At the end, called .load_cleanup for each device and vCPUs are started.
> > > > >                         |
> > > > >         (RUNNING, _NONE, _RUNNING)
> > > > > 
> > > > > Note that:
> > > > > - Migration post copy is not supported.
> > > > > 
> > > > > v6 -> v7:
> > > > > - Fix build failures.
> > > > > 
> > > > > v5 -> v6:
> > > > > - Fix build failure.
> > > > > 
> > > > > v4 -> v5:
> > > > > - Added decriptive comment about the sequence of access of members of structure
> > > > >   vfio_device_migration_info to be followed based on Alex's suggestion
> > > > > - Updated get dirty pages sequence.
> > > > > - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
> > > > >   get_object, save_config and load_config.
> > > > > - Fixed multiple nit picks.
> > > > > - Tested live migration with multiple vfio device assigned to a VM.
> > > > > 
> > > > > v3 -> v4:
> > > > > - Added one more bit for _RESUMING flag to be set explicitly.
> > > > > - data_offset field is read-only for user space application.
> > > > > - data_size is read for every iteration before reading data from migration, that
> > > > >   is removed assumption that data will be till end of migration region.
> > > > > - If vendor driver supports mappable sparsed region, map those region during
> > > > >   setup state of save/load, similarly unmap those from cleanup routines.
> > > > > - Handles race condition that causes data corruption in migration region during
> > > > >   save device state by adding mutex and serialiaing save_buffer and
> > > > >   get_dirty_pages routines.
> > > > > - Skip called get_dirty_pages routine for mapped MMIO region of device.
> > > > > - Added trace events.
> > > > > - Splitted into multiple functional patches.
> > > > > 
> > > > > v2 -> v3:
> > > > > - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> > > > > - Re-structured vfio_device_migration_info to keep it minimal and defined action
> > > > >   on read and write access on its members.
> > > > > 
> > > > > v1 -> v2:
> > > > > - Defined MIGRATION region type and sub-type which should be used with region
> > > > >   type capability.
> > > > > - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> > > > >   offset of migration region.
> > > > > - Replaced ioctl with read/write for trapped part of migration region.
> > > > > - Added both type of access support, trapped or mmapped, for data section of the
> > > > >   region.
> > > > > - Moved PCI device functions to pci file.
> > > > > - Added iteration to get dirty page bitmap until bitmap for all requested pages
> > > > >   are copied.
> > > > > 
> > > > > Thanks,
> > > > > Kirti
> > > > > 
> > > > > Kirti Wankhede (13):
> > > > >   vfio: KABI for migration interface
> > > > >   vfio: Add function to unmap VFIO region
> > > > >   vfio: Add vfio_get_object callback to VFIODeviceOps
> > > > >   vfio: Add save and load functions for VFIO PCI devices
> > > > >   vfio: Add migration region initialization and finalize function
> > > > >   vfio: Add VM state change handler to know state of VM
> > > > >   vfio: Add migration state change notifier
> > > > >   vfio: Register SaveVMHandlers for VFIO device
> > > > >   vfio: Add save state functions to SaveVMHandlers
> > > > >   vfio: Add load state functions to SaveVMHandlers
> > > > >   vfio: Add function to get dirty page list
> > > > >   vfio: Add vfio_listerner_log_sync to mark dirty pages
> > > > >   vfio: Make vfio-pci device migration capable.
> > > > > 
> > > > >  hw/vfio/Makefile.objs         |   2 +-
> > > > >  hw/vfio/common.c              |  55 +++
> > > > >  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
> > > > >  hw/vfio/pci.c                 | 137 ++++++-
> > > > >  hw/vfio/trace-events          |  19 +
> > > > >  include/hw/vfio/vfio-common.h |  25 ++
> > > > >  linux-headers/linux/vfio.h    | 166 ++++++++
> > > > >  7 files changed, 1271 insertions(+), 7 deletions(-)
> > > > >  create mode 100644 hw/vfio/migration.c
> > > > > 
> > > > > -- 
> > > > > 2.7.0
> > > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-11 19:08         ` Kirti Wankhede
@ 2019-07-12  0:32           ` Yan Zhao
  2019-07-18 18:32             ` Kirti Wankhede
  2019-07-12 17:42           ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-12  0:32 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	Dr. David Alan Gilbert, Wang, Zhi A, mlevitsk, pasic, aik,
	alex.williamson, eauger, felipe, jonathan.davies, Liu, Changpeng,
	Ken.Xue

On Fri, Jul 12, 2019 at 03:08:31AM +0800, Kirti Wankhede wrote:
> 
> 
> On 7/11/2019 9:53 PM, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> >> On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
> >>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> >>>> Hi Kirti,
> >>>> There are still unaddressed comments to your patches v4.
> >>>> Would you mind addressing them?
> >>>>
> >>>> 1. should we register two migration interfaces simultaneously
> >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
> >>>
> >>> Please don't do this.
> >>> As far as I'm aware we currently only have one device that does that
> >>> (vmxnet3) and a patch has just been posted that fixes/removes that.
> >>>
> >>> Dave
> >>>
> >> hi Dave,
> >> Thanks for notifying this. but if we want to support postcopy in future,
> >> after device stops, what interface could we use to transfer data of
> >> device state only?
> >> for postcopy, when source device stops, we need to transfer only
> >> necessary device state to target vm before target vm starts, and we
> >> don't want to transfer device memory as we'll do that after target vm
> >> resuming.
> > 
> > Hmm ok, lets see; that's got to happen in the call to:
> >     qemu_savevm_state_complete_precopy(fb, false, false);
> > that's made from postcopy_start.
> >  (the false's are iterable_only and inactivate_disks)
> > 
> > and at that time I believe the state is POSTCOPY_ACTIVE, so in_postcopy
> > is true.
> > 
> > If you're doing postcopy, then you'll probably define a has_postcopy()
> > function, so qemu_savevm_state_complete_precopy will skip the
> > save_live_complete_precopy call from it's loop for at least two of the
> > reasons in it's big if.
> > 
> > So you're right; you need the VMSD for this to happen in the second
> > loop in qemu_savevm_state_complete_precopy.  Hmm.
> > 
> > Now, what worries me, and I don't know the answer, is how the section
> > header for the vmstate and the section header for an iteration look
> > on the stream; how are they different?
> > 
> 
> I don't have way to test postcopy migration - is one of the major reason
> I had not included postcopy support in this patchset and clearly called
> out in cover letter.
> This patchset is thoroughly tested for precopy migration.
> If anyone have hardware that supports fault, then I would prefer to add
> postcopy support as incremental change later which can be tested before
> submitting.
> 
> Just a suggestion, instead of using VMSD, is it possible to have some
> additional check to call save_live_complete_precopy from
> qemu_savevm_state_complete_precopy?
> 
> 
> >>
> >>>> 2. in each save iteration, how much data is to be saved
> >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
> 
> > how big is the data_size ?
> > if this size is too big, it may take too much time and block others.
> 
> I do had mentioned this in the comment about the structure in vfio.h
> header. data_size will be provided by vendor driver and obviously will
> not be greater that migration region size. Vendor driver should be
> responsible to keep its solution optimized.
>
if the data_size is no big than migration region size, and each
iteration only saves data of data_size, i'm afraid it will cause
prolonged down time. after all, the migration region size cannot be very
big.
Also, if vendor driver determines how much data to save in each
iteration alone, and no checks in qemu, it may cause other devices'
migration time be squeezed.

> 
> >>>> 3. do we need extra interface to get data for device state only
> >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
> 
> I don't think so. Opaque Device data from vendor driver can include
> device state and device memory. Vendor driver who is managing his device
> can decide how to place data over the stream.
>
I know current design is opaque device data. then to support postcopy,
we may have to add extra device state like in-postcopy. but postcopy is
more like a qemu state and is not a device state.
to address it elegantly, may we add an extra interface besides
vfio_save_buffer() to get data for device state only?

> >>>> 4. definition of dirty page copied_pfn
> >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
> >>>>
> 
> This was inline to discussion going with Alex. I addressed the concern
> there. Please check current patchset, which addresses the concerns raised.
>
ok. I saw you also updated the flow in the part. please check my comment
in that patch for detail. but as a suggestion, I think processed_pfns is
a better name compared to copied_pfns :)

> >>>> Also, I'm glad to see that you updated code by following my comments below,
> >>>> but please don't forget to reply my comments next time:)
> 
> I tried to reply top of threads and addressed common concerns raised in
> that. Sorry If I missed any, I'll make sure to point you to my replies
> going ahead.
> 
ok. let's cooperate:)

Thanks
Yan

> Thanks,
> Kirti
> 
> >>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
> >>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
> >>>>
> >>>> Thanks
> >>>> Yan
> >>>>
> >>>> On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> >>>>> Add migration support for VFIO device
> >>>>>
> >>>>> This Patch set include patches as below:
> >>>>> - Define KABI for VFIO device for migration support.
> >>>>> - Added save and restore functions for PCI configuration space
> >>>>> - Generic migration functionality for VFIO device.
> >>>>>   * This patch set adds functionality only for PCI devices, but can be
> >>>>>     extended to other VFIO devices.
> >>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> >>>>>     resume phases of migration.
> >>>>>   * Added state change notifier and from that notifier function, VFIO
> >>>>>     device's state changed is conveyed to VFIO device driver.
> >>>>>   * During save setup phase and resume/load setup phase, migration region
> >>>>>     is queried and is used to read/write VFIO device data.
> >>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> >>>>>     functionality of iteration during pre-copy phase.
> >>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> >>>>>     iteration to read data from VFIO device driver is implemented till pending
> >>>>>     bytes returned by driver are not zero.
> >>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> >>>>>     driver.
> >>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> >>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> >>>>>   driver, migration is blocked.
> >>>>>
> >>>>> Below is the flow of state change for live migration where states in brackets
> >>>>> represent VM state, migration state and VFIO device state as:
> >>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> >>>>>
> >>>>> Live migration save path:
> >>>>>         QEMU normal running state
> >>>>>         (RUNNING, _NONE, _RUNNING)
> >>>>>                         |
> >>>>>     migrate_init spawns migration_thread.
> >>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> >>>>>     Migration thread then calls each device's .save_setup()
> >>>>>                         |
> >>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> >>>>>     If device is active, get pending bytes by .save_live_pending()
> >>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> >>>>>     Data of VFIO device for pre-copy phase is copied.
> >>>>>     Iterate till pending bytes converge and are less than threshold
> >>>>>                         |
> >>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> >>>>>     for each active device. VFIO device is then transitioned in
> >>>>>      _SAVING state.
> >>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> >>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> >>>>>     pending data is 0.
> >>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> >>>>>                         |
> >>>>>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> >>>>>     Migraton thread schedule cleanup bottom half and exit
> >>>>>
> >>>>> Live migration resume path:
> >>>>>     Incomming migration calls .load_setup for each device
> >>>>>     (RESTORE_VM, _ACTIVE, STOPPED)
> >>>>>                         |
> >>>>>     For each device, .load_state is called for that device section data
> >>>>>                         |
> >>>>>     At the end, called .load_cleanup for each device and vCPUs are started.
> >>>>>                         |
> >>>>>         (RUNNING, _NONE, _RUNNING)
> >>>>>
> >>>>> Note that:
> >>>>> - Migration post copy is not supported.
> >>>>>
> >>>>> v6 -> v7:
> >>>>> - Fix build failures.
> >>>>>
> >>>>> v5 -> v6:
> >>>>> - Fix build failure.
> >>>>>
> >>>>> v4 -> v5:
> >>>>> - Added decriptive comment about the sequence of access of members of structure
> >>>>>   vfio_device_migration_info to be followed based on Alex's suggestion
> >>>>> - Updated get dirty pages sequence.
> >>>>> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
> >>>>>   get_object, save_config and load_config.
> >>>>> - Fixed multiple nit picks.
> >>>>> - Tested live migration with multiple vfio device assigned to a VM.
> >>>>>
> >>>>> v3 -> v4:
> >>>>> - Added one more bit for _RESUMING flag to be set explicitly.
> >>>>> - data_offset field is read-only for user space application.
> >>>>> - data_size is read for every iteration before reading data from migration, that
> >>>>>   is removed assumption that data will be till end of migration region.
> >>>>> - If vendor driver supports mappable sparsed region, map those region during
> >>>>>   setup state of save/load, similarly unmap those from cleanup routines.
> >>>>> - Handles race condition that causes data corruption in migration region during
> >>>>>   save device state by adding mutex and serialiaing save_buffer and
> >>>>>   get_dirty_pages routines.
> >>>>> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> >>>>> - Added trace events.
> >>>>> - Splitted into multiple functional patches.
> >>>>>
> >>>>> v2 -> v3:
> >>>>> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> >>>>> - Re-structured vfio_device_migration_info to keep it minimal and defined action
> >>>>>   on read and write access on its members.
> >>>>>
> >>>>> v1 -> v2:
> >>>>> - Defined MIGRATION region type and sub-type which should be used with region
> >>>>>   type capability.
> >>>>> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> >>>>>   offset of migration region.
> >>>>> - Replaced ioctl with read/write for trapped part of migration region.
> >>>>> - Added both type of access support, trapped or mmapped, for data section of the
> >>>>>   region.
> >>>>> - Moved PCI device functions to pci file.
> >>>>> - Added iteration to get dirty page bitmap until bitmap for all requested pages
> >>>>>   are copied.
> >>>>>
> >>>>> Thanks,
> >>>>> Kirti
> >>>>>
> >>>>> Kirti Wankhede (13):
> >>>>>   vfio: KABI for migration interface
> >>>>>   vfio: Add function to unmap VFIO region
> >>>>>   vfio: Add vfio_get_object callback to VFIODeviceOps
> >>>>>   vfio: Add save and load functions for VFIO PCI devices
> >>>>>   vfio: Add migration region initialization and finalize function
> >>>>>   vfio: Add VM state change handler to know state of VM
> >>>>>   vfio: Add migration state change notifier
> >>>>>   vfio: Register SaveVMHandlers for VFIO device
> >>>>>   vfio: Add save state functions to SaveVMHandlers
> >>>>>   vfio: Add load state functions to SaveVMHandlers
> >>>>>   vfio: Add function to get dirty page list
> >>>>>   vfio: Add vfio_listerner_log_sync to mark dirty pages
> >>>>>   vfio: Make vfio-pci device migration capable.
> >>>>>
> >>>>>  hw/vfio/Makefile.objs         |   2 +-
> >>>>>  hw/vfio/common.c              |  55 +++
> >>>>>  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
> >>>>>  hw/vfio/pci.c                 | 137 ++++++-
> >>>>>  hw/vfio/trace-events          |  19 +
> >>>>>  include/hw/vfio/vfio-common.h |  25 ++
> >>>>>  linux-headers/linux/vfio.h    | 166 ++++++++
> >>>>>  7 files changed, 1271 insertions(+), 7 deletions(-)
> >>>>>  create mode 100644 hw/vfio/migration.c
> >>>>>
> >>>>> -- 
> >>>>> 2.7.0
> >>>>>
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list Kirti Wankhede
@ 2019-07-12  0:33   ` Yan Zhao
  2019-07-18 18:39     ` Kirti Wankhede
  2019-07-22  8:39   ` Yan Zhao
  1 sibling, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-12  0:33 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Jul 09, 2019 at 05:49:18PM +0800, Kirti Wankhede wrote:
> Dirty page tracking (.log_sync) is part of RAM copying state, where
> vendor driver provides the bitmap of pages which are dirtied by vendor
> driver through migration region and as part of RAM copy, those pages
> gets copied to file stream.
> 
> To get dirty page bitmap:
> - write start address, page_size and pfn count.
> - read count of pfns copied.
>     - Vendor driver should return 0 if driver doesn't have any page to
>       report dirty in given range.
>     - Vendor driver should return -1 to mark all pages dirty for given range.
> - read data_offset, where vendor driver has written bitmap.
> - read bitmap from the region or mmaped part of the region.
> - Iterate above steps till page bitmap for all requested pfns are copied.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 123 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   1 +
>  include/hw/vfio/vfio-common.h |   2 +
>  3 files changed, 126 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 5fb4c5329ede..ca1a8c0f5f1f 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -269,6 +269,129 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> +                              uint64_t start_pfn,
> +                              uint64_t pfn_count,
> +                              uint64_t page_size)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t count = 0;
> +    int64_t copied_pfns = 0;
> +    int64_t total_pfns = pfn_count;
> +    int ret;
> +
> +    qemu_mutex_lock(&migration->lock);
> +
> +    while (total_pfns > 0) {
> +        uint64_t bitmap_size, data_offset = 0;
> +        uint64_t start = start_pfn + count;
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        ret = pwrite(vbasedev->fd, &start, sizeof(start),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              start_pfn));
> +        if (ret < 0) {
> +            error_report("%s: Failed to set dirty pages start address %d %s",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto dpl_unlock;
> +        }
> +
> +        ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              page_size));
> +        if (ret < 0) {
> +            error_report("%s: Failed to set dirty page size %d %s",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto dpl_unlock;
> +        }
> +
> +        ret = pwrite(vbasedev->fd, &total_pfns, sizeof(total_pfns),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              total_pfns));
> +        if (ret < 0) {
> +            error_report("%s: Failed to set dirty page total pfns %d %s",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto dpl_unlock;
> +        }
> +
> +        /* Read copied dirty pfns */
> +        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             copied_pfns));
> +        if (ret < 0) {
> +            error_report("%s: Failed to get dirty pages bitmap count %d %s",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto dpl_unlock;
> +        }
> +
> +        if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_NONE) {
> +            /*
> +             * copied_pfns could be 0 if driver doesn't have any page to
> +             * report dirty in given range
> +             */
> +            break;
> +        } else if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_ALL) {
> +            /* Mark all pages dirty for this range */
> +            cpu_physical_memory_set_dirty_range(start_pfn * page_size,
> +                                                pfn_count * page_size,
> +                                                DIRTY_MEMORY_MIGRATION);
seesm pfn_count here is not right
> +            break;
> +        }
> +
> +        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
> +
> +        ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +        if (ret != sizeof(data_offset)) {
> +            error_report("%s: Failed to get migration buffer data offset %d",
> +                         vbasedev->name, ret);
> +            goto dpl_unlock;
> +        }
> +
> +        if (region->mmaps) {
> +            buf = find_data_region(region, data_offset, bitmap_size);
> +        }
> +
> +        buffer_mmaped = (buf != NULL) ? true : false;
> +
> +        if (!buffer_mmaped) {
> +            buf = g_try_malloc0(bitmap_size);
> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                goto dpl_unlock;
> +            }
> +
> +            ret = pread(vbasedev->fd, buf, bitmap_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != bitmap_size) {
> +                error_report("%s: Failed to get dirty pages bitmap %d",
> +                             vbasedev->name, ret);
> +                g_free(buf);
> +                goto dpl_unlock;
> +            }
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> +                                               (start_pfn + count) * page_size,
> +                                                copied_pfns);
> +        count      += copied_pfns;
> +        total_pfns -= copied_pfns;
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +    }
> +
> +    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
> +                                   page_size);
> +
> +dpl_unlock:
> +    qemu_mutex_unlock(&migration->lock);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ac065b559f4e..414a5e69ec5e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
>  vfio_load_device_config_state(char *name) " (%s)"
>  vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
>  vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a022484d2636..dc1b83a0b4ef 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -222,5 +222,7 @@ int vfio_spapr_remove_window(VFIOContainer *container,
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
>  void vfio_migration_finalize(VFIODevice *vbasedev);
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
> +                               uint64_t pfn_count, uint64_t page_size);
>  
>  #endif /* HW_VFIO_VFIO_COMMON_H */
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2019-07-12  2:44   ` Yan Zhao
  2019-07-18 18:45     ` Kirti Wankhede
  2019-07-17  2:50   ` Yan Zhao
  1 sibling, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-12  2:44 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Jul 09, 2019 at 05:49:16PM +0800, Kirti Wankhede wrote:
> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes
> - read data_offset - indicates kernel driver to write data to staging
>   buffer which is mmapped.
> - read data_size - amount of data in bytes written by vendor driver in migration
>   region.
> - if data section is trapped, pread() from data_offset of data_size.
> - if data section is mmaped, read mmaped buffer of data_size.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes
> c. read data_offset - indicates kernel driver to write data to staging
>    buffer which is mmapped.
> d. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> e. if data section is trapped, pread() from data_offset of data_size.
> f. if data section is mmaped, read mmaped buffer of data_size.
> g. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> h. iterate through steps b to g while (pending_bytes > 0)
> i. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from
> data_offset of data_size before moving to next steps.
> 
> .save_live_iterate runs outside the iothread lock in the migration case, which
> could race with asynchronous call to get dirty page list causing data corruption
> in mapped migration region. Mutex added here to serial migration buffer read
> operation.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |   6 ++
>  2 files changed, 252 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 0597a45fda2d..4e9b4cce230b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -117,6 +117,138 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>      return 0;
>  }
>  
> +static void *find_data_region(VFIORegion *region,
> +                              uint64_t data_offset,
> +                              uint64_t data_size)
> +{
> +    void *ptr = NULL;
> +    int i;
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        if ((data_offset >= region->mmaps[i].offset) &&
> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
> +            (data_size <= region->mmaps[i].size)) {
> +            ptr = region->mmaps[i].mmap + (data_offset -
> +                                           region->mmaps[i].offset);
> +            break;
> +        }
> +    }
> +    return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t data_offset = 0, data_size = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret != sizeof(data_offset)) {
> +        error_report("%s: Failed to get migration buffer data offset %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret != sizeof(data_size)) {
> +        error_report("%s: Failed to get migration buffer data size %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    if (data_size > 0) {
> +        void *buf = NULL;
> +        bool buffer_mmaped;
> +
> +        if (region->mmaps) {
> +            buf = find_data_region(region, data_offset, data_size);
> +        }
> +
> +        buffer_mmaped = (buf != NULL) ? true : false;
> +
> +        if (!buffer_mmaped) {
> +            buf = g_try_malloc0(data_size);
> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                return -ENOMEM;
> +            }
> +
> +            ret = pread(vbasedev->fd, buf, data_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != data_size) {
> +                error_report("%s: Failed to get migration data %d",
> +                             vbasedev->name, ret);
> +                g_free(buf);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, data_size);
> +        qemu_put_buffer(f, buf, data_size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +        migration->pending_bytes -= data_size;
This line "migration->pending_bytes -= data_size;" is not necessary, as
it will be updated anyway in vfio_update_pending()

> +    } else {
> +        qemu_put_be64(f, data_size);
> +    }
> +
> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +                           migration->pending_bytes);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return data_size;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> +        error_report("%s: Failed to get pending bytes %d",
> +                     vbasedev->name, ret);
> +        migration->pending_bytes = 0;
> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    trace_vfio_save_device_config_state(vbasedev->name);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -178,9 +310,123 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    *res_precopy_only += migration->pending_bytes;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> +                            *res_postcopy_only, *res_compatible);
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret, data_size;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    qemu_mutex_lock(&migration->lock);
> +    data_size = vfio_save_buffer(f, vbasedev);
> +    qemu_mutex_unlock(&migration->lock);
> +
> +    if (data_size < 0) {
> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> +                     strerror(errno));
> +        return data_size;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> +    if (data_size == 0) {
> +        /* indicates data finished, goto complete phase */
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOP and SAVING",
> +                     vbasedev->name);
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev);
> +        if (ret < 0) {
> +            error_report("%s: Failed to save buffer", vbasedev->name);
> +            return ret;
> +        } else if (ret == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
> +        return ret;
> +    }
> +
> +    trace_vfio_save_complete_precopy(vbasedev->name);
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 4bb43f18f315..bdf40ba368c7 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -151,3 +151,9 @@ vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_st
>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>  vfio_save_setup(char *name) " (%s)"
>  vfio_save_cleanup(char *name) " (%s)"
> +vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
> +vfio_update_pending(char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
> +vfio_save_device_config_state(char *name) " (%s)"
> +vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
> +vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
> +vfio_save_complete_precopy(char *name) " (%s)"
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 10/13] vfio: Add load state functions to SaveVMHandlers
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 10/13] vfio: Add load " Kirti Wankhede
@ 2019-07-12  2:52   ` Yan Zhao
  2019-07-18 19:00     ` Kirti Wankhede
  0 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-12  2:52 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Jul 09, 2019 at 05:49:17PM +0800, Kirti Wankhede wrote:
> Flow during _RESUMING device state:
> - If Vendor driver defines mappable region, mmap migration region.
> - Load config state.
> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
>     - read data_size from packet, read buffer of data_size
>     - read data_offset from where QEMU should write data.
>         if region is mmaped, write data of data_size to mmaped region.
>     - write data_size.
>         In case of mmapped region, write to data_size indicates kernel
>         driver that data is written in staging buffer.
>     - if region is trapped, pwrite() data of data_size from data_offset.
> - Repeat above until VFIO_MIG_FLAG_END_OF_STATE.
> - Unmap migration region.
> 
> For user, data is opaque. User should write data in the same order as
> received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |   3 +
>  2 files changed, 165 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 4e9b4cce230b..5fb4c5329ede 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -249,6 +249,26 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    uint64_t data;
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> +        vbasedev->ops->vfio_load_config(vbasedev, f);
> +    }
> +
> +    data = qemu_get_be64(f);
> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("%s: Failed loading device config space, "
> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> +        return -EINVAL;
> +    }
> +
> +    trace_vfio_load_device_config_state(vbasedev->name);
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -421,12 +441,154 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      return ret;
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +
> +    if (migration->region.buffer.mmaps) {
> +        ret = vfio_region_mmap(&migration->region.buffer);
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.index,
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING);
> +    if (ret) {
> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> +    }
> +    return ret;
> +}
> +
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    vfio_save_cleanup(opaque);
> +    return 0;
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +    uint64_t data, data_size;
> +
I think checking of version_id is still needed.

Thanks
Yan

> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +
> +        trace_vfio_load_state(vbasedev->name, data);
> +
> +        switch (data) {
> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> +        {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> +        {
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return ret;
> +            } else {
> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
> +                             vbasedev->name, data);
> +                return -EINVAL;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> +        {
> +            VFIORegion *region = &migration->region.buffer;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +            uint64_t data_offset = 0;
> +
> +            data_size = qemu_get_be64(f);
> +            if (data_size == 0) {
> +                break;
> +            }
> +
> +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                        region->fd_offset +
> +                        offsetof(struct vfio_device_migration_info,
> +                        data_offset));
> +            if (ret != sizeof(data_offset)) {
> +                error_report("%s:Failed to get migration buffer data offset %d",
> +                             vbasedev->name, ret);
> +                return -EINVAL;
> +            }
> +
> +            if (region->mmaps) {
> +                buf = find_data_region(region, data_offset, data_size);
> +            }
> +
> +            buffer_mmaped = (buf != NULL) ? true : false;
> +
> +            if (!buffer_mmaped) {
> +                buf = g_try_malloc0(data_size);
> +                if (!buf) {
> +                    error_report("%s: Error allocating buffer ", __func__);
> +                    return -ENOMEM;
> +                }
> +            }
> +
> +            qemu_get_buffer(f, buf, data_size);
> +
> +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
> +                         region->fd_offset +
> +                       offsetof(struct vfio_device_migration_info, data_size));
> +            if (ret != sizeof(data_size)) {
> +                error_report("%s: Failed to set migration buffer data size %d",
> +                             vbasedev->name, ret);
> +                if (!buffer_mmaped) {
> +                    g_free(buf);
> +                }
> +                return -EINVAL;
> +            }
> +
> +            if (!buffer_mmaped) {
> +                ret = pwrite(vbasedev->fd, buf, data_size,
> +                             region->fd_offset + data_offset);
> +                g_free(buf);
> +
> +                if (ret != data_size) {
> +                    error_report("%s: Failed to set migration buffer %d",
> +                                 vbasedev->name, ret);
> +                    return -EINVAL;
> +                }
> +            }
> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> +                                              data_size);
> +            break;
> +        }
> +        }
> +
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +        data = qemu_get_be64(f);
> +    }
> +
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
>      .save_live_pending = vfio_save_pending,
>      .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .load_state = vfio_load_state,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bdf40ba368c7..ac065b559f4e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -157,3 +157,6 @@ vfio_save_device_config_state(char *name) " (%s)"
>  vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>  vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
>  vfio_save_complete_precopy(char *name) " (%s)"
> +vfio_load_device_config_state(char *name) " (%s)"
> +vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
> +vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-11 19:08         ` Kirti Wankhede
  2019-07-12  0:32           ` Yan Zhao
@ 2019-07-12 17:42           ` Dr. David Alan Gilbert
  2019-07-15  0:35             ` Yan Zhao
  1 sibling, 1 reply; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-12 17:42 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, cohuck, shuangtai.tst, qemu-devel, Wang,
	Zhi A, mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Yan Zhao, Liu, Changpeng, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 7/11/2019 9:53 PM, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> >> On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
> >>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> >>>> Hi Kirti,
> >>>> There are still unaddressed comments to your patches v4.
> >>>> Would you mind addressing them?
> >>>>
> >>>> 1. should we register two migration interfaces simultaneously
> >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
> >>>
> >>> Please don't do this.
> >>> As far as I'm aware we currently only have one device that does that
> >>> (vmxnet3) and a patch has just been posted that fixes/removes that.
> >>>
> >>> Dave
> >>>
> >> hi Dave,
> >> Thanks for notifying this. but if we want to support postcopy in future,
> >> after device stops, what interface could we use to transfer data of
> >> device state only?
> >> for postcopy, when source device stops, we need to transfer only
> >> necessary device state to target vm before target vm starts, and we
> >> don't want to transfer device memory as we'll do that after target vm
> >> resuming.
> > 
> > Hmm ok, lets see; that's got to happen in the call to:
> >     qemu_savevm_state_complete_precopy(fb, false, false);
> > that's made from postcopy_start.
> >  (the false's are iterable_only and inactivate_disks)
> > 
> > and at that time I believe the state is POSTCOPY_ACTIVE, so in_postcopy
> > is true.
> > 
> > If you're doing postcopy, then you'll probably define a has_postcopy()
> > function, so qemu_savevm_state_complete_precopy will skip the
> > save_live_complete_precopy call from it's loop for at least two of the
> > reasons in it's big if.
> > 
> > So you're right; you need the VMSD for this to happen in the second
> > loop in qemu_savevm_state_complete_precopy.  Hmm.
> > 
> > Now, what worries me, and I don't know the answer, is how the section
> > header for the vmstate and the section header for an iteration look
> > on the stream; how are they different?
> > 
> 
> I don't have way to test postcopy migration - is one of the major reason
> I had not included postcopy support in this patchset and clearly called
> out in cover letter.
> This patchset is thoroughly tested for precopy migration.
> If anyone have hardware that supports fault, then I would prefer to add
> postcopy support as incremental change later which can be tested before
> submitting.

Agreed; although I think Yan's right to think about how it's going to
work.

> Just a suggestion, instead of using VMSD, is it possible to have some
> additional check to call save_live_complete_precopy from
> qemu_savevm_state_complete_precopy?

Probably yes; although as you can tell the logic in their is already
pretty hairy.

Dave

> 
> >>
> >>>> 2. in each save iteration, how much data is to be saved
> >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
> 
> > how big is the data_size ?
> > if this size is too big, it may take too much time and block others.
> 
> I do had mentioned this in the comment about the structure in vfio.h
> header. data_size will be provided by vendor driver and obviously will
> not be greater that migration region size. Vendor driver should be
> responsible to keep its solution optimized.
> 
> 
> >>>> 3. do we need extra interface to get data for device state only
> >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
> 
> I don't think so. Opaque Device data from vendor driver can include
> device state and device memory. Vendor driver who is managing his device
> can decide how to place data over the stream.
> 
> >>>> 4. definition of dirty page copied_pfn
> >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
> >>>>
> 
> This was inline to discussion going with Alex. I addressed the concern
> there. Please check current patchset, which addresses the concerns raised.
> 
> >>>> Also, I'm glad to see that you updated code by following my comments below,
> >>>> but please don't forget to reply my comments next time:)
> 
> I tried to reply top of threads and addressed common concerns raised in
> that. Sorry If I missed any, I'll make sure to point you to my replies
> going ahead.
> 
> Thanks,
> Kirti
> 
> >>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
> >>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
> >>>>
> >>>> Thanks
> >>>> Yan
> >>>>
> >>>> On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> >>>>> Add migration support for VFIO device
> >>>>>
> >>>>> This Patch set include patches as below:
> >>>>> - Define KABI for VFIO device for migration support.
> >>>>> - Added save and restore functions for PCI configuration space
> >>>>> - Generic migration functionality for VFIO device.
> >>>>>   * This patch set adds functionality only for PCI devices, but can be
> >>>>>     extended to other VFIO devices.
> >>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> >>>>>     resume phases of migration.
> >>>>>   * Added state change notifier and from that notifier function, VFIO
> >>>>>     device's state changed is conveyed to VFIO device driver.
> >>>>>   * During save setup phase and resume/load setup phase, migration region
> >>>>>     is queried and is used to read/write VFIO device data.
> >>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> >>>>>     functionality of iteration during pre-copy phase.
> >>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> >>>>>     iteration to read data from VFIO device driver is implemented till pending
> >>>>>     bytes returned by driver are not zero.
> >>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> >>>>>     driver.
> >>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> >>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> >>>>>   driver, migration is blocked.
> >>>>>
> >>>>> Below is the flow of state change for live migration where states in brackets
> >>>>> represent VM state, migration state and VFIO device state as:
> >>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> >>>>>
> >>>>> Live migration save path:
> >>>>>         QEMU normal running state
> >>>>>         (RUNNING, _NONE, _RUNNING)
> >>>>>                         |
> >>>>>     migrate_init spawns migration_thread.
> >>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> >>>>>     Migration thread then calls each device's .save_setup()
> >>>>>                         |
> >>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> >>>>>     If device is active, get pending bytes by .save_live_pending()
> >>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> >>>>>     Data of VFIO device for pre-copy phase is copied.
> >>>>>     Iterate till pending bytes converge and are less than threshold
> >>>>>                         |
> >>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> >>>>>     for each active device. VFIO device is then transitioned in
> >>>>>      _SAVING state.
> >>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> >>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> >>>>>     pending data is 0.
> >>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> >>>>>                         |
> >>>>>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> >>>>>     Migraton thread schedule cleanup bottom half and exit
> >>>>>
> >>>>> Live migration resume path:
> >>>>>     Incomming migration calls .load_setup for each device
> >>>>>     (RESTORE_VM, _ACTIVE, STOPPED)
> >>>>>                         |
> >>>>>     For each device, .load_state is called for that device section data
> >>>>>                         |
> >>>>>     At the end, called .load_cleanup for each device and vCPUs are started.
> >>>>>                         |
> >>>>>         (RUNNING, _NONE, _RUNNING)
> >>>>>
> >>>>> Note that:
> >>>>> - Migration post copy is not supported.
> >>>>>
> >>>>> v6 -> v7:
> >>>>> - Fix build failures.
> >>>>>
> >>>>> v5 -> v6:
> >>>>> - Fix build failure.
> >>>>>
> >>>>> v4 -> v5:
> >>>>> - Added decriptive comment about the sequence of access of members of structure
> >>>>>   vfio_device_migration_info to be followed based on Alex's suggestion
> >>>>> - Updated get dirty pages sequence.
> >>>>> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
> >>>>>   get_object, save_config and load_config.
> >>>>> - Fixed multiple nit picks.
> >>>>> - Tested live migration with multiple vfio device assigned to a VM.
> >>>>>
> >>>>> v3 -> v4:
> >>>>> - Added one more bit for _RESUMING flag to be set explicitly.
> >>>>> - data_offset field is read-only for user space application.
> >>>>> - data_size is read for every iteration before reading data from migration, that
> >>>>>   is removed assumption that data will be till end of migration region.
> >>>>> - If vendor driver supports mappable sparsed region, map those region during
> >>>>>   setup state of save/load, similarly unmap those from cleanup routines.
> >>>>> - Handles race condition that causes data corruption in migration region during
> >>>>>   save device state by adding mutex and serialiaing save_buffer and
> >>>>>   get_dirty_pages routines.
> >>>>> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> >>>>> - Added trace events.
> >>>>> - Splitted into multiple functional patches.
> >>>>>
> >>>>> v2 -> v3:
> >>>>> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> >>>>> - Re-structured vfio_device_migration_info to keep it minimal and defined action
> >>>>>   on read and write access on its members.
> >>>>>
> >>>>> v1 -> v2:
> >>>>> - Defined MIGRATION region type and sub-type which should be used with region
> >>>>>   type capability.
> >>>>> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> >>>>>   offset of migration region.
> >>>>> - Replaced ioctl with read/write for trapped part of migration region.
> >>>>> - Added both type of access support, trapped or mmapped, for data section of the
> >>>>>   region.
> >>>>> - Moved PCI device functions to pci file.
> >>>>> - Added iteration to get dirty page bitmap until bitmap for all requested pages
> >>>>>   are copied.
> >>>>>
> >>>>> Thanks,
> >>>>> Kirti
> >>>>>
> >>>>> Kirti Wankhede (13):
> >>>>>   vfio: KABI for migration interface
> >>>>>   vfio: Add function to unmap VFIO region
> >>>>>   vfio: Add vfio_get_object callback to VFIODeviceOps
> >>>>>   vfio: Add save and load functions for VFIO PCI devices
> >>>>>   vfio: Add migration region initialization and finalize function
> >>>>>   vfio: Add VM state change handler to know state of VM
> >>>>>   vfio: Add migration state change notifier
> >>>>>   vfio: Register SaveVMHandlers for VFIO device
> >>>>>   vfio: Add save state functions to SaveVMHandlers
> >>>>>   vfio: Add load state functions to SaveVMHandlers
> >>>>>   vfio: Add function to get dirty page list
> >>>>>   vfio: Add vfio_listerner_log_sync to mark dirty pages
> >>>>>   vfio: Make vfio-pci device migration capable.
> >>>>>
> >>>>>  hw/vfio/Makefile.objs         |   2 +-
> >>>>>  hw/vfio/common.c              |  55 +++
> >>>>>  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
> >>>>>  hw/vfio/pci.c                 | 137 ++++++-
> >>>>>  hw/vfio/trace-events          |  19 +
> >>>>>  include/hw/vfio/vfio-common.h |  25 ++
> >>>>>  linux-headers/linux/vfio.h    | 166 ++++++++
> >>>>>  7 files changed, 1271 insertions(+), 7 deletions(-)
> >>>>>  create mode 100644 hw/vfio/migration.c
> >>>>>
> >>>>> -- 
> >>>>> 2.7.0
> >>>>>
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-12 17:42           ` Dr. David Alan Gilbert
@ 2019-07-15  0:35             ` Yan Zhao
  0 siblings, 0 replies; 77+ messages in thread
From: Yan Zhao @ 2019-07-15  0:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, cohuck, shuangtai.tst,
	alex.williamson, Wang, Zhi A, mlevitsk, pasic, aik,
	Kirti Wankhede, eauger, felipe, jonathan.davies, Liu, Changpeng,
	Ken.Xue

On Sat, Jul 13, 2019 at 01:42:39AM +0800, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > 
> > 
> > On 7/11/2019 9:53 PM, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > >> On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
> > >>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > >>>> Hi Kirti,
> > >>>> There are still unaddressed comments to your patches v4.
> > >>>> Would you mind addressing them?
> > >>>>
> > >>>> 1. should we register two migration interfaces simultaneously
> > >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
> > >>>
> > >>> Please don't do this.
> > >>> As far as I'm aware we currently only have one device that does that
> > >>> (vmxnet3) and a patch has just been posted that fixes/removes that.
> > >>>
> > >>> Dave
> > >>>
> > >> hi Dave,
> > >> Thanks for notifying this. but if we want to support postcopy in future,
> > >> after device stops, what interface could we use to transfer data of
> > >> device state only?
> > >> for postcopy, when source device stops, we need to transfer only
> > >> necessary device state to target vm before target vm starts, and we
> > >> don't want to transfer device memory as we'll do that after target vm
> > >> resuming.
> > > 
> > > Hmm ok, lets see; that's got to happen in the call to:
> > >     qemu_savevm_state_complete_precopy(fb, false, false);
> > > that's made from postcopy_start.
> > >  (the false's are iterable_only and inactivate_disks)
> > > 
> > > and at that time I believe the state is POSTCOPY_ACTIVE, so in_postcopy
> > > is true.
> > > 
> > > If you're doing postcopy, then you'll probably define a has_postcopy()
> > > function, so qemu_savevm_state_complete_precopy will skip the
> > > save_live_complete_precopy call from it's loop for at least two of the
> > > reasons in it's big if.
> > > 
> > > So you're right; you need the VMSD for this to happen in the second
> > > loop in qemu_savevm_state_complete_precopy.  Hmm.
> > > 
> > > Now, what worries me, and I don't know the answer, is how the section
> > > header for the vmstate and the section header for an iteration look
> > > on the stream; how are they different?
> > > 
> > 
> > I don't have way to test postcopy migration - is one of the major reason
> > I had not included postcopy support in this patchset and clearly called
> > out in cover letter.
> > This patchset is thoroughly tested for precopy migration.
> > If anyone have hardware that supports fault, then I would prefer to add
> > postcopy support as incremental change later which can be tested before
> > submitting.
> 
> Agreed; although I think Yan's right to think about how it's going to
> work.
> 
> > Just a suggestion, instead of using VMSD, is it possible to have some
> > additional check to call save_live_complete_precopy from
> > qemu_savevm_state_complete_precopy?
> 
> Probably yes; although as you can tell the logic in their is already
> pretty hairy.
> 
This might be a solution. but in that case, lots of modules' 
.save_live_complete_precopy needs to be rewritten, including that of
ram. and we need to redefine the interface of .save_live_complete_precopy.
maybe we can add a new interface like .save_live_pre_postcopy in SaveVMHandlers?

Thanks
Yan

> Dave
> 
> > 
> > >>
> > >>>> 2. in each save iteration, how much data is to be saved
> > >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
> > 
> > > how big is the data_size ?
> > > if this size is too big, it may take too much time and block others.
> > 
> > I do had mentioned this in the comment about the structure in vfio.h
> > header. data_size will be provided by vendor driver and obviously will
> > not be greater that migration region size. Vendor driver should be
> > responsible to keep its solution optimized.
> > 
> > 
> > >>>> 3. do we need extra interface to get data for device state only
> > >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
> > 
> > I don't think so. Opaque Device data from vendor driver can include
> > device state and device memory. Vendor driver who is managing his device
> > can decide how to place data over the stream.
> > 
> > >>>> 4. definition of dirty page copied_pfn
> > >>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
> > >>>>
> > 
> > This was inline to discussion going with Alex. I addressed the concern
> > there. Please check current patchset, which addresses the concerns raised.
> > 
> > >>>> Also, I'm glad to see that you updated code by following my comments below,
> > >>>> but please don't forget to reply my comments next time:)
> > 
> > I tried to reply top of threads and addressed common concerns raised in
> > that. Sorry If I missed any, I'll make sure to point you to my replies
> > going ahead.
> > 
> > Thanks,
> > Kirti
> > 
> > >>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
> > >>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
> > >>>>
> > >>>> Thanks
> > >>>> Yan
> > >>>>
> > >>>> On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> > >>>>> Add migration support for VFIO device
> > >>>>>
> > >>>>> This Patch set include patches as below:
> > >>>>> - Define KABI for VFIO device for migration support.
> > >>>>> - Added save and restore functions for PCI configuration space
> > >>>>> - Generic migration functionality for VFIO device.
> > >>>>>   * This patch set adds functionality only for PCI devices, but can be
> > >>>>>     extended to other VFIO devices.
> > >>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> > >>>>>     resume phases of migration.
> > >>>>>   * Added state change notifier and from that notifier function, VFIO
> > >>>>>     device's state changed is conveyed to VFIO device driver.
> > >>>>>   * During save setup phase and resume/load setup phase, migration region
> > >>>>>     is queried and is used to read/write VFIO device data.
> > >>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> > >>>>>     functionality of iteration during pre-copy phase.
> > >>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> > >>>>>     iteration to read data from VFIO device driver is implemented till pending
> > >>>>>     bytes returned by driver are not zero.
> > >>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> > >>>>>     driver.
> > >>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> > >>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> > >>>>>   driver, migration is blocked.
> > >>>>>
> > >>>>> Below is the flow of state change for live migration where states in brackets
> > >>>>> represent VM state, migration state and VFIO device state as:
> > >>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > >>>>>
> > >>>>> Live migration save path:
> > >>>>>         QEMU normal running state
> > >>>>>         (RUNNING, _NONE, _RUNNING)
> > >>>>>                         |
> > >>>>>     migrate_init spawns migration_thread.
> > >>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> > >>>>>     Migration thread then calls each device's .save_setup()
> > >>>>>                         |
> > >>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> > >>>>>     If device is active, get pending bytes by .save_live_pending()
> > >>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> > >>>>>     Data of VFIO device for pre-copy phase is copied.
> > >>>>>     Iterate till pending bytes converge and are less than threshold
> > >>>>>                         |
> > >>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> > >>>>>     for each active device. VFIO device is then transitioned in
> > >>>>>      _SAVING state.
> > >>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> > >>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> > >>>>>     pending data is 0.
> > >>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> > >>>>>                         |
> > >>>>>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> > >>>>>     Migraton thread schedule cleanup bottom half and exit
> > >>>>>
> > >>>>> Live migration resume path:
> > >>>>>     Incomming migration calls .load_setup for each device
> > >>>>>     (RESTORE_VM, _ACTIVE, STOPPED)
> > >>>>>                         |
> > >>>>>     For each device, .load_state is called for that device section data
> > >>>>>                         |
> > >>>>>     At the end, called .load_cleanup for each device and vCPUs are started.
> > >>>>>                         |
> > >>>>>         (RUNNING, _NONE, _RUNNING)
> > >>>>>
> > >>>>> Note that:
> > >>>>> - Migration post copy is not supported.
> > >>>>>
> > >>>>> v6 -> v7:
> > >>>>> - Fix build failures.
> > >>>>>
> > >>>>> v5 -> v6:
> > >>>>> - Fix build failure.
> > >>>>>
> > >>>>> v4 -> v5:
> > >>>>> - Added decriptive comment about the sequence of access of members of structure
> > >>>>>   vfio_device_migration_info to be followed based on Alex's suggestion
> > >>>>> - Updated get dirty pages sequence.
> > >>>>> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
> > >>>>>   get_object, save_config and load_config.
> > >>>>> - Fixed multiple nit picks.
> > >>>>> - Tested live migration with multiple vfio device assigned to a VM.
> > >>>>>
> > >>>>> v3 -> v4:
> > >>>>> - Added one more bit for _RESUMING flag to be set explicitly.
> > >>>>> - data_offset field is read-only for user space application.
> > >>>>> - data_size is read for every iteration before reading data from migration, that
> > >>>>>   is removed assumption that data will be till end of migration region.
> > >>>>> - If vendor driver supports mappable sparsed region, map those region during
> > >>>>>   setup state of save/load, similarly unmap those from cleanup routines.
> > >>>>> - Handles race condition that causes data corruption in migration region during
> > >>>>>   save device state by adding mutex and serialiaing save_buffer and
> > >>>>>   get_dirty_pages routines.
> > >>>>> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> > >>>>> - Added trace events.
> > >>>>> - Splitted into multiple functional patches.
> > >>>>>
> > >>>>> v2 -> v3:
> > >>>>> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> > >>>>> - Re-structured vfio_device_migration_info to keep it minimal and defined action
> > >>>>>   on read and write access on its members.
> > >>>>>
> > >>>>> v1 -> v2:
> > >>>>> - Defined MIGRATION region type and sub-type which should be used with region
> > >>>>>   type capability.
> > >>>>> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> > >>>>>   offset of migration region.
> > >>>>> - Replaced ioctl with read/write for trapped part of migration region.
> > >>>>> - Added both type of access support, trapped or mmapped, for data section of the
> > >>>>>   region.
> > >>>>> - Moved PCI device functions to pci file.
> > >>>>> - Added iteration to get dirty page bitmap until bitmap for all requested pages
> > >>>>>   are copied.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Kirti
> > >>>>>
> > >>>>> Kirti Wankhede (13):
> > >>>>>   vfio: KABI for migration interface
> > >>>>>   vfio: Add function to unmap VFIO region
> > >>>>>   vfio: Add vfio_get_object callback to VFIODeviceOps
> > >>>>>   vfio: Add save and load functions for VFIO PCI devices
> > >>>>>   vfio: Add migration region initialization and finalize function
> > >>>>>   vfio: Add VM state change handler to know state of VM
> > >>>>>   vfio: Add migration state change notifier
> > >>>>>   vfio: Register SaveVMHandlers for VFIO device
> > >>>>>   vfio: Add save state functions to SaveVMHandlers
> > >>>>>   vfio: Add load state functions to SaveVMHandlers
> > >>>>>   vfio: Add function to get dirty page list
> > >>>>>   vfio: Add vfio_listerner_log_sync to mark dirty pages
> > >>>>>   vfio: Make vfio-pci device migration capable.
> > >>>>>
> > >>>>>  hw/vfio/Makefile.objs         |   2 +-
> > >>>>>  hw/vfio/common.c              |  55 +++
> > >>>>>  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
> > >>>>>  hw/vfio/pci.c                 | 137 ++++++-
> > >>>>>  hw/vfio/trace-events          |  19 +
> > >>>>>  include/hw/vfio/vfio-common.h |  25 ++
> > >>>>>  linux-headers/linux/vfio.h    | 166 ++++++++
> > >>>>>  7 files changed, 1271 insertions(+), 7 deletions(-)
> > >>>>>  create mode 100644 hw/vfio/migration.c
> > >>>>>
> > >>>>> -- 
> > >>>>> 2.7.0
> > >>>>>
> > >>> --
> > >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 02/13] vfio: Add function to unmap VFIO region
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
@ 2019-07-16 16:29   ` Cornelia Huck
  2019-07-18 18:54     ` Kirti Wankhede
  0 siblings, 1 reply; 77+ messages in thread
From: Cornelia Huck @ 2019-07-16 16:29 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, qemu-devel,
	Zhengxiao.zx, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 9 Jul 2019 15:19:09 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> This function is used in follwing patch in this series.

"This function will be used for the migration region." ?

("This series" will be a bit confusing when this has been merged :)

> Migration region is mmaped when migration starts and will be unmapped when
> migration is complete.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c              | 20 ++++++++++++++++++++
>  hw/vfio/trace-events          |  1 +
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 22 insertions(+)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
@ 2019-07-16 16:32   ` Cornelia Huck
  0 siblings, 0 replies; 77+ messages in thread
From: Cornelia Huck @ 2019-07-16 16:32 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, qemu-devel,
	Zhengxiao.zx, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 9 Jul 2019 15:19:10 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Hook vfio_get_object callback for PCI devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Suggested-by: Cornelia Huck <cohuck@redhat.com>
> ---
>  hw/vfio/pci.c                 | 8 ++++++++
>  include/hw/vfio/vfio-common.h | 1 +
>  2 files changed, 9 insertions(+)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface Kirti Wankhede
@ 2019-07-16 20:56   ` Alex Williamson
  2019-07-17 11:55     ` Cornelia Huck
                       ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Alex Williamson @ 2019-07-16 20:56 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 9 Jul 2019 15:19:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> - Used 3 bits to define VFIO device states.
>     Bit 0 => _RUNNING
>     Bit 1 => _SAVING
>     Bit 2 => _RESUMING
>     Combination of these bits defines VFIO device's state during migration
>     _STOPPED => All bits 0 indicates VFIO device stopped.
>     _RUNNING => Normal VFIO device running state.
>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>                           saving state of device i.e. pre-copy state
>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>                           save device state,i.e. stop-n-copy state
>     _RESUMING => VFIO device resuming state.
>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access:
>     * device_state: (read/write)
>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>         as of now.
>     * pending bytes: (read only)
>         To get pending bytes yet to be migrated for VFIO device.
>     * data_offset: (read only)
>         To get data offset in migration from where data exist during _SAVING
>         and from where data should be written by user space application during
>          _RESUMING state
>     * data_size: (read/write)
>         To get and set size of data copied in migration region during _SAVING
>         and _RESUMING state.
>     * start_pfn, page_size, total_pfns: (write only)
>         To get bitmap of dirty pages from vendor driver from given
>         start address for total_pfns.
>     * copied_pfns: (read only)
>         To get number of pfns bitmap copied in migration region.
>         Vendor driver should copy the bitmap with bits set only for
>         pages to be marked dirty in migration region. Vendor driver
>         should return 0 if there are 0 pages dirty in requested
>         range. Vendor driver should return -1 to mark all pages in the section
>         as dirty
> 
> Migration region looks like:
>  ------------------------------------------------------------------
> |vfio_device_migration_info|    data section                      |
> |                          |     ///////////////////////////////  |
>  ------------------------------------------------------------------
>  ^                              ^                              ^
>  offset 0-trapped part        data_offset                 data_size
> 
> Data section is always followed by vfio_device_migration_info
> structure in the region, so data_offset will always be none-0.
> Offset from where data is copied is decided by kernel driver, data
> section can be trapped or mapped depending on how kernel driver
> defines data section. If mmapped, then data_offset should be page
> aligned, where as initial section which contain
> vfio_device_migration_info structure might not end at offset which
> is page aligned.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 166 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 166 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 24f505199f83..6696a4600545 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -372,6 +372,172 @@ struct vfio_region_gfx_edid {
>   */
>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>  
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (2)

Region type #2 is already claimed by VFIO_REGION_TYPE_CCW, so this would
need to be #3 or greater (we should have a reference table somewhere in
this header as it gets easier to miss claimed entries as the sprawl
grows).

> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise should return error.

This seems like a good unit test, a userspace driver that performs
unaligned accesses to this space.  I'm afraid the wording above might
suggest that if there's no error it must work though, which might put
us in sticky support situations.  Should we say:

s/should return error/the result is undefined and vendor drivers should
return an error/

> + *
> + * device_state: (read/write)
> + *      To indicate vendor driver the state VFIO device should be transitioned
> + *      to. If device state transition fails, write on this field return error.
> + *      It consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> + *        device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state.
> + *      - If bit 2 set, indicates _RESUMING state.
> + *      _SAVING and _RESUMING set at the same time is invalid state.

I think in the previous version there was a question of how we handle
yet-to-be-defined bits.  For instance, if we defined a
SUBTYPE_MIGRATIONv2 with the intention of making it backwards
compatible with this version, do we declare the undefined bits as
preserved so that the user should do a read-modify-write operation?

> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver

Is this for _SAVING, _RESUMING, or both?

> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state or read dirty pages bitmap. See below
> + *      for detail of sequence to be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      migration region during _SAVING state and write size of data copied in
> + *      migration region during _RESUMING state.
> + *
> + * start_pfn: (write only)
> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> + *      _SAVING state.

There are some subtleties in PFN that I'm not sure we're accounting for
here.  Devices operate in an IOVA space, which is defined by DMA_MAP
calls.  The user says this IOVA maps to this process virtual address.
When there is no vIOMMU, we can \assume\ that IOVA ~= GPA and therefore
this interface provides dirty gfns.  However when we have a vIOMMU, we
don't know the IOVA to GPA mapping, right?  So is it expected that the
user is calling this with GFNs relative to the device address space
(IOVA) or relative to the VM address space (GPA)?  For the kernel
internal mdev interface, the pin pages API is always operating in the
device view and I think never cares if those are IOVA or GPA.

> + *
> + * page_size: (write only)
> + *      User application should write the page_size of pfn.
> + *
> + * total_pfns: (write only)
> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> + *
> + * copied_pfns: (read only)
> + *      pfn count for which dirty bitmap is copied to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if none of the
> + *        pages are dirty in requested range or rest of the range.
> + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
> + *        pages dirty in the given section.

Does this have the same semantics as _NONE in being able to use it to
report "all the remaining unreported pfns are dirty"?

> + *      - Vendor driver should return pfn count for which bitmap is written in
> + *        the region.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^                              ^
> + *  offset 0-trapped part        data_offset                 data_size
> + *
> + * Data section is always followed by vfio_device_migration_info structure
> + * in the region, so data_offset will always be none-0. Offset from where data

s/none-0/non-0/  Or better, non-zero

> + * is copied is decided by kernel driver, data section can be trapped or
> + * mapped depending on how kernel driver defines data section. If mmapped,
> + * then data_offset should be page aligned, where as initial section which
> + * contain vfio_device_migration_info structure might not end at offset which
> + * is page aligned.
> + * Data_offset can be same or different for device data and dirty page bitmap.
> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.

I think we also want to talk about how the mmap support within this
region is defined by a sparse mmap capability (this is required if
any of it is mmap capable to support the non-mmap'd header) and the
vendor driver can make portions of the data section mmap'able and
others not.  I believe (unless we want to require otherwise) that the
data_offset to data_offset+data_size range can arbitrarily span mmap
supported sections to meet the vendor driver's needs.

> + *
> + * Sequence to be followed:
> + * In _SAVING|_RUNNING device state or pre-copy phase:
> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates kernel driver to write data to staging buffer
> + *    which is mmapped.

There's no requirement that it be mmap'd, right?  The vendor driver has
the choice whether to support mmap, the user has the choice whether to
access via mmap or read/write.

> + * c. read data_size, amount of data in bytes written by vendor driver in
> + *    migration region.
> + * d. if data section is trapped, read from data_offset of data_size.
> + * e. if data section is mmaped, read data_size bytes from mmaped buffer from
> + *    data_offset in the migration region.

Is it really necessary to specify these separately?  The user should
read from data_offset to data_offset+data_size, optionally via direct
mapped buffer as supported by the sparse mmap support within the region.

> + * f. Write data_size and data to file stream.

This is not really part of our specification, the user does whatever
they want with the data.

> + * g. iterate through steps a to f while (pending_bytes > 0)

Is the read of pending_bytes an implicit indication to the vendor
driver that the data area has been consumed?  If so, should this
sequence always end with a read of pending_bytes to indicate to the
vendor driver to flush that data?  I'm assuming there will be gap where
the user reads save data from the device, does other things, and comes
back to read more data.

What assumptions, if any, can the user make about pending_bytes?  For
instance, if the device is _RUNNING, I assume no assumptions can be
made, maybe with the exception that it represents the minimum pending
state at that instant of time.  The rate at which we're approaching
convergence might be inferred, but any method to determine that would
be beyond the scope here.

> + * In _SAVING device state or stop-and-copy phase:
> + * a. read config space of device and save to migration file stream. This
> + *    doesn't need to be from vendor driver. Any other special config state
> + *    from driver can be saved as data in following iteration.

This is beyond the scope of the migration interface here (and config
space is PCI specific).

> + * b. read pending_bytes.
> + * c. read data_offset, indicates kernel driver to write data to staging
> + *    buffer which is mmapped.

Or not.

> + * d. read data_size, amount of data in bytes written by vendor driver in
> + *    migration region.
> + * e. if data section is trapped, read from data_offset of data_size.
> + * f. if data section is mmaped, read data_size bytes from mmaped buffer from
> + *    data_offset in the migration region.

Same comment as above.

> + * g. Write data_size and data to file stream

Outside of the scope.

> + * h. iterate through steps b to g while (pending_bytes > 0)

Same question regarding indicating to vendor driver that the buffer has
been consumed.

> + *
> + * When data region is mapped, its user's responsibility to read data from
> + * data_offset of data_size before moving to next steps.

Do we really want to condition this on being mmap'd?  This implies that
when it is not mmap'd the vendor driver tracks the accesses to make
sure that it was consumed?

> + * Dirty page tracking is part of RAM copy state, where vendor driver
> + * provides the bitmap of pages which are dirtied by vendor driver through
> + * migration region and as part of RAM copy those pages gets copied to file
> + * stream.

We're mixing QEMU/VM use cases here, this is only the kernel interface
spec, which can be used for such things, but is not tied to them.  RAM
ties to the previous question of the address space and implies we're
operating in the GFN space while the device really only knows about the
IOVA space.

> + *
> + * To get dirty page bitmap:
> + * a. write start_pfn, page_size and total_pfns.

Is it required to write every field every time?  For instance page_size
seems like it should only ever need to be written once.  Is there any
ordering required?  It seems like step b) initiates the vendor driver
to consume these fields, but that's not specified below.

> + * b. read copied_pfns.
> + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if driver
> + *       doesn't have any page to report dirty in given range or rest of the
> + *       range. Exit loop.
> + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
> + *       pages dirty for given range. Mark all pages in the range as dirty and
> + *       exit the loop.
> + *     - Vendor driver should return copied_pfns and provide bitmap for
> + *       copied_pfn, which means that bitmap copied for given range contains
> + *       information for all pages where some bits are 0s and some are 1s.
> + * c. read data_offset, where vendor driver has written bitmap.
> + * d. read bitmap from the region or mmaped part of the region.
> + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)

I thought there was some automatic iteration built into this interface,
is that dropped?  The user is now expected to do start_pf +=
copied_pfns and total_pfns -= copied_pfns themsevles?  Does anything
indicate to the vendor driver when the data area has been consumed such
that resources can be released?

> + *
> + * In _RESUMING device state:
> + * - Load device config state.

Out of scope.

> + * - While end of data for this device is not reached, repeat below steps:
> + *      - read data_size from file stream, read data from file stream of
> + *        data_size.

Out of scope, how the user gets the data is a userspace implementation
detail.  I think the important detail here is simply that each data
transaction from the _SAVING process is indivisible and must translate
to a _RESUMING transaction here.

> + *      - read data_offset from where User application should write data.
> + *          if region is mmaped, write data of data_size to mmaped region.
> + *      - write data_size.
> + *          In case of mmapped region, write on data_size indicates kernel
> + *          driver that data is written in staging buffer.
> + *      - if region is trapped, write data of data_size from data_offset.

Gack!  We need something better here, the sequence should be the same
regardless of the mechanism used to write the data.

It still confuses me how the resuming side can know where (data_offset)
the incoming data should be written.  If we're migrating a !_RUNNING
device, then I can see how some portion of the device might be directly
mmap'd and the sequence would be very deterministic.  But if we're
migrating a _RUNNING device, wouldn't the current data block depend on
what portions of the device are active, which would be difficult to
predict?

> + *
> + * For user application, data is opaque. User should write data in the same
> + * order as received.
> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +                                     VFIO_DEVICE_STATE_SAVING | \
> +                                     VFIO_DEVICE_STATE_RESUMING)

Yes, we have the mask in here now, but no mention above how the user
should handle undefined bits.  Thanks,

Alex

> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> +                                     VFIO_DEVICE_STATE_RESUMING)
> +        __u32 reserved;
> +        __u64 pending_bytes;
> +        __u64 data_offset;
> +        __u64 data_size;
> +        __u64 start_pfn;
> +        __u64 page_size;
> +        __u64 total_pfns;
> +        __u64 copied_pfns;
> +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
  2019-07-11 12:07   ` Dr. David Alan Gilbert
@ 2019-07-16 21:14   ` Alex Williamson
  2019-07-17  9:10     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 77+ messages in thread
From: Alex Williamson @ 2019-07-16 21:14 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 9 Jul 2019 15:19:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> These functions save and restore PCI device specific data - config
> space of PCI device.
> Tested save and restore with MSI and MSIX type.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c                 | 114 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   2 +
>  2 files changed, 116 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index de0d286fc9dd..5fe4f8076cac 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2395,11 +2395,125 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>      return OBJECT(vdev);
>  }
>  
> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint16_t pci_cmd;
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar;
> +
> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar);
> +    }
> +
> +    qemu_put_be32(f, vdev->interrupt);
> +    if (vdev->interrupt == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                                            2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        qemu_put_be32(f, msi_addr_lo);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                             4);
> +        }
> +        qemu_put_be32(f, msi_addr_hi);
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +        qemu_put_be32(f, msi_data);
> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> +        uint16_t offset;
> +
> +        /* save enable bit and maskall bit */
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        qemu_put_be16(f, offset);
> +        msix_save(pdev, f);
> +    }
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    qemu_put_be16(f, pci_cmd);
> +}
> +
> +static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +    uint16_t pci_cmd;
> +    bool msi_64bit;
> +    int i;
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = qemu_get_be32(f);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> +
> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +        msi_addr_lo = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        msi_addr_hi = qemu_get_be32(f);
> +        if (msi_64bit) {
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +        msi_data = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset = qemu_get_be16(f);
> +
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +        msix_load(pdev, f);
> +    }
> +    pci_cmd = qemu_get_be16(f);
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +}

Pardon my migration ignorance, but there are bound to be more fields
and capabilities that get migrated over time.  How does this get
extended to support that and maintain backwards compatibility?  Thanks,

Alex

> +
>  static VFIODeviceOps vfio_pci_ops = {
>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>      .vfio_eoi = vfio_intx_eoi,
>      .vfio_get_object = vfio_pci_get_object,
> +    .vfio_save_config = vfio_pci_save_config,
> +    .vfio_load_config = vfio_pci_load_config,
>  };
>  
>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 771b6d59a3db..ee72bd984a36 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>      void (*vfio_eoi)(VFIODevice *vdev);
>      Object *(*vfio_get_object)(VFIODevice *vdev);
> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> +    void (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>  };
>  
>  typedef struct VFIOGroup {



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 05/13] vfio: Add migration region initialization and finalize function
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2019-07-16 21:37   ` Alex Williamson
  2019-07-18 20:19     ` Kirti Wankhede
  2019-07-23 12:52   ` Cornelia Huck
  1 sibling, 1 reply; 77+ messages in thread
From: Alex Williamson @ 2019-07-16 21:37 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 9 Jul 2019 15:19:12 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
>   patch series.
> - VFIO device supports migration or not is decided based of migration region
>   query. If migration region query is successful and migration region
>   initialization is successful then migration is supported else migration is
>   blocked.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 145 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   3 +
>  include/hw/vfio/vfio-common.h |  14 ++++
>  4 files changed, 163 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index abad8b818c9b..36033d1437c5 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,4 +1,4 @@
> -obj-y += common.o spapr.o
> +obj-y += common.o spapr.o migration.o
>  obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_VFIO_PLATFORM) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..a2cfbd5af2e1
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,145 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2019
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +#include "trace.h"
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.buffer.size) {

Having a VFIORegion named buffer within a struct named region is
unnecessarily confusing.  Please fix.

> +        vfio_region_exit(&migration->region.buffer);
> +        vfio_region_finalize(&migration->region.buffer);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!migration) {
> +        return ret;
> +    }
> +
> +    if (!vbasedev->ops || !vbasedev->ops->vfio_get_object) {
> +        return ret;
> +    }
> +
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
> +                            migration->region.index, "migration");
> +    if (ret) {
> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
> +                     vbasedev->name, migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.buffer.size) {
> +        ret = -EINVAL;
> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
> +                     vbasedev->name, migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_init(VFIODevice *vbasedev,
> +                               struct vfio_region_info *info)
> +{
> +    int ret;
> +
> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> +    vbasedev->migration->region.index = info->index;

VFIORegion already caches the region index, VFIORegion.nr.

> +
> +    ret = vfio_migration_region_init(vbasedev);
> +    if (ret) {
> +        error_report("%s: Failed to initialise migration region",
> +                     vbasedev->name);
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +/* ---------------------------------------------------------------------- */
> +
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> +{
> +    struct vfio_region_info *info;
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    if (ret) {
> +        goto add_blocker;
> +    }
> +
> +    ret = vfio_migration_init(vbasedev, info);
> +    if (ret) {
> +        goto add_blocker;
> +    }
> +
> +    trace_vfio_migration_probe(vbasedev->name, info->index);
> +    return 0;
> +
> +add_blocker:
> +    error_setg(&vbasedev->migration_blocker,
> +               "VFIO device doesn't support migration"); 
> +    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +    return ret;
> +}
> +
> +void vfio_migration_finalize(VFIODevice *vbasedev)
> +{
> +    if (!vbasedev->migration) {
> +        return;

Don't we allocate migration_blocker even for this case?  Thanks,

Alex

> +    }
> +
> +    if (vbasedev->migration_blocker) {
> +        migrate_del_blocker(vbasedev->migration_blocker);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +
> +    vfio_migration_region_exit(vbasedev);
> +    g_free(vbasedev->migration);
> +}
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 8cdc27946cb8..191a726a1312 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
>  vfio_display_edid_link_down(void) ""
>  vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>  vfio_display_edid_write_error(void) ""
> +
> +# migration.c
> +vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ee72bd984a36..152da3f8d6f3 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -57,6 +57,15 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */
>  } VFIORegion;
>  
> +typedef struct VFIOMigration {
> +    struct {
> +        VFIORegion buffer;
> +        uint32_t index;
> +    } region;
> +    uint64_t pending_bytes;
> +    QemuMutex lock;
> +} VFIOMigration;
> +
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers;
> @@ -113,6 +122,8 @@ typedef struct VFIODevice {
>      unsigned int num_irqs;
>      unsigned int num_regions;
>      unsigned int flags;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> @@ -204,4 +215,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>  int vfio_spapr_remove_window(VFIOContainer *container,
>                               hwaddr offset_within_address_space);
>  
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> +void vfio_migration_finalize(VFIODevice *vbasedev);
> +
>  #endif /* HW_VFIO_VFIO_COMMON_H */



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
  2019-07-11 12:13   ` Dr. David Alan Gilbert
@ 2019-07-16 22:03   ` Alex Williamson
  2019-07-22  8:37   ` Yan Zhao
  2 siblings, 0 replies; 77+ messages in thread
From: Alex Williamson @ 2019-07-16 22:03 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 9 Jul 2019 15:19:13 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VM state change handler gets called on change in VM's state. This is used to set
> VFIO device state to _RUNNING.
> VM state change handler, migration state change handler and log_sync listener
> are called asynchronously, which sometimes lead to data corruption in migration
> region. Initialised mutex that is used to serialize operations on migration data
> region during saving state.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 64 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  2 ++
>  include/hw/vfio/vfio-common.h |  4 +++
>  3 files changed, 70 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index a2cfbd5af2e1..c01f08b659d0 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -78,6 +78,60 @@ err:
>      return ret;
>  }
>  
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint32_t device_state;
> +    int ret = 0;
> +
> +    device_state = (state & VFIO_DEVICE_STATE_MASK) |
> +                   (vbasedev->device_state & ~VFIO_DEVICE_STATE_MASK);
> +
> +    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
> +        return -EINVAL;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("%s: Failed to set device state %d %s",
> +                     vbasedev->name, ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    vbasedev->device_state = device_state;

Do we need to re-read device_state after error?  We defined _SAVING |
_RESUMING as STATE_INVALID, is that only for user writes, ie. the
device can never transition to that state to indicate a fault?  I was
thinking that was one if its use cases.  Thanks,

Alex

> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> +    return 0;
> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running)) {
> +        int ret;
> +        uint32_t dev_state;
> +
> +        if (running) {
> +            dev_state = VFIO_DEVICE_STATE_RUNNING;
> +        } else {
> +            dev_state = (vbasedev->device_state & VFIO_DEVICE_STATE_MASK) &
> +                     ~VFIO_DEVICE_STATE_RUNNING;
> +        }
> +
> +        ret = vfio_migration_set_state(vbasedev, dev_state);
> +        if (ret) {
> +            error_report("%s: Failed to set device state 0x%x",
> +                         vbasedev->name, dev_state);
> +        }
> +        vbasedev->vm_running = running;
> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                                  dev_state);
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -93,6 +147,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    qemu_mutex_init(&vbasedev->migration->lock);
> +
> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
> +
>      return 0;
>  }
>  
> @@ -135,11 +194,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>          return;
>      }
>  
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +    }
> +
>      if (vbasedev->migration_blocker) {
>          migrate_del_blocker(vbasedev->migration_blocker);
>          error_free(vbasedev->migration_blocker);
>      }
>  
> +    qemu_mutex_destroy(&vbasedev->migration->lock);
>      vfio_migration_region_exit(vbasedev);
>      g_free(vbasedev->migration);
>  }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 191a726a1312..3d15bacd031a 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 152da3f8d6f3..f6c70db3a9c1 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -29,6 +29,7 @@
>  #ifdef CONFIG_LINUX
>  #include <linux/vfio.h>
>  #endif
> +#include "sysemu/sysemu.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -124,6 +125,9 @@ typedef struct VFIODevice {
>      unsigned int flags;
>      VFIOMigration *migration;
>      Error *migration_blocker;
> +    uint32_t device_state;
> +    VMChangeStateEntry *vm_state;
> +    int vm_running;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 07/13] vfio: Add migration state change notifier
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 07/13] vfio: Add migration state change notifier Kirti Wankhede
@ 2019-07-17  2:25   ` Yan Zhao
  2019-08-20 20:24     ` Kirti Wankhede
  0 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-17  2:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Jul 09, 2019 at 05:49:14PM +0800, Kirti Wankhede wrote:
> Added migration state change notifier to get notification on migration state
> change. These states are translated to VFIO device state and conveyed to vendor
> driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 54 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  1 +
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 56 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index c01f08b659d0..e4a89a6f9bc7 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -132,6 +132,53 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>      }
>  }
>  
> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_ACTIVE:
> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> +            if (vbasedev->vm_running) {
> +                ret = vfio_migration_set_state(vbasedev,
> +                          VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
> +                if (ret) {
> +                    error_report("%s: Failed to set state RUNNING and SAVING",
> +                                  vbasedev->name);
> +                }
> +            } else {
> +                ret = vfio_migration_set_state(vbasedev,
> +                                               VFIO_DEVICE_STATE_SAVING);
> +                if (ret) {
> +                    error_report("%s: Failed to set state STOP and SAVING",
> +                                 vbasedev->name);
> +                }
> +            }
> +        } else {
> +            ret = vfio_migration_set_state(vbasedev,
> +                                           VFIO_DEVICE_STATE_RESUMING);
> +            if (ret) {
> +                error_report("%s: Failed to set state RESUMING",
> +                             vbasedev->name);
> +            }
> +        }
> +        return;
> +
hi Kirti
currently, migration state notifiers are only notified in below 3 interfaces:
migrate_fd_connect, migrate_fd_cleanup, postcopy_start, where
MIGRATION_STATUS_ACTIVE is not an valid state.
Have you tested the above code? what's the purpose of the code?

Thanks
Yan

> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +        }
> +        return;
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -152,6 +199,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>  
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> +
>      return 0;
>  }
>  
> @@ -194,6 +244,10 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>          return;
>      }
>  
> +    if (vbasedev->migration_state.notify) {
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
>      if (vbasedev->vm_state) {
>          qemu_del_vm_change_state_handler(vbasedev->vm_state);
>      }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 3d15bacd031a..69503228f20e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -148,3 +148,4 @@ vfio_display_edid_write_error(void) ""
>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> +vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index f6c70db3a9c1..a022484d2636 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -128,6 +128,7 @@ typedef struct VFIODevice {
>      uint32_t device_state;
>      VMChangeStateEntry *vm_state;
>      int vm_running;
> +    Notifier migration_state;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
  2019-07-12  2:44   ` Yan Zhao
@ 2019-07-17  2:50   ` Yan Zhao
  2019-08-20 20:30     ` Kirti Wankhede
  1 sibling, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-17  2:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Jul 09, 2019 at 05:49:16PM +0800, Kirti Wankhede wrote:
> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes
> - read data_offset - indicates kernel driver to write data to staging
>   buffer which is mmapped.
> - read data_size - amount of data in bytes written by vendor driver in migration
>   region.
> - if data section is trapped, pread() from data_offset of data_size.
> - if data section is mmaped, read mmaped buffer of data_size.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes
> c. read data_offset - indicates kernel driver to write data to staging
>    buffer which is mmapped.
> d. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> e. if data section is trapped, pread() from data_offset of data_size.
> f. if data section is mmaped, read mmaped buffer of data_size.
> g. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> h. iterate through steps b to g while (pending_bytes > 0)
> i. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from
> data_offset of data_size before moving to next steps.
>
each iteration, vendor driver has to set data_offset for device data once
and for dirty page once. it's really cumbersome.
could dirty page and device data use different data_offset? e.g.
data_offset, dirty_page_offset? or just keep them constant after being
read first time?

> .save_live_iterate runs outside the iothread lock in the migration case, which
> could race with asynchronous call to get dirty page list causing data corruption
> in mapped migration region. Mutex added here to serial migration buffer read
> operation.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |   6 ++
>  2 files changed, 252 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 0597a45fda2d..4e9b4cce230b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -117,6 +117,138 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>      return 0;
>  }
>  
> +static void *find_data_region(VFIORegion *region,
> +                              uint64_t data_offset,
> +                              uint64_t data_size)
> +{
> +    void *ptr = NULL;
> +    int i;
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        if ((data_offset >= region->mmaps[i].offset) &&
> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
> +            (data_size <= region->mmaps[i].size)) {
> +            ptr = region->mmaps[i].mmap + (data_offset -
> +                                           region->mmaps[i].offset);
> +            break;
> +        }
> +    }
> +    return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t data_offset = 0, data_size = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret != sizeof(data_offset)) {
> +        error_report("%s: Failed to get migration buffer data offset %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret != sizeof(data_size)) {
> +        error_report("%s: Failed to get migration buffer data size %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    if (data_size > 0) {
> +        void *buf = NULL;
> +        bool buffer_mmaped;
> +
> +        if (region->mmaps) {
> +            buf = find_data_region(region, data_offset, data_size);
> +        }
> +
> +        buffer_mmaped = (buf != NULL) ? true : false;
> +
> +        if (!buffer_mmaped) {
> +            buf = g_try_malloc0(data_size);
> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                return -ENOMEM;
> +            }
> +
> +            ret = pread(vbasedev->fd, buf, data_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != data_size) {
> +                error_report("%s: Failed to get migration data %d",
> +                             vbasedev->name, ret);
> +                g_free(buf);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, data_size);
> +        qemu_put_buffer(f, buf, data_size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +        migration->pending_bytes -= data_size;
> +    } else {
> +        qemu_put_be64(f, data_size);
> +    }
> +
> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +                           migration->pending_bytes);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return data_size;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> +        error_report("%s: Failed to get pending bytes %d",
> +                     vbasedev->name, ret);
> +        migration->pending_bytes = 0;
> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    trace_vfio_save_device_config_state(vbasedev->name);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -178,9 +310,123 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    *res_precopy_only += migration->pending_bytes;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> +                            *res_postcopy_only, *res_compatible);
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret, data_size;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    qemu_mutex_lock(&migration->lock);
> +    data_size = vfio_save_buffer(f, vbasedev);
> +    qemu_mutex_unlock(&migration->lock);
> +
> +    if (data_size < 0) {
> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> +                     strerror(errno));
> +        return data_size;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> +    if (data_size == 0) {
> +        /* indicates data finished, goto complete phase */
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
I think this state is already set in vm state change handler, where
~VFIO_DEVICE_STATE_RUNNING is applied to (VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING).

Why VFIO_DEVICE_STATE_SAVING is redundantly set here?

> +    if (ret) {
> +        error_report("%s: Failed to set state STOP and SAVING",
> +                     vbasedev->name);
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev);
> +        if (ret < 0) {
> +            error_report("%s: Failed to save buffer", vbasedev->name);
> +            return ret;
> +        } else if (ret == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK);
This state is a little weird.

Thanks
Yan
> +    if (ret) {
> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
> +        return ret;
> +    }
> +
> +    trace_vfio_save_complete_precopy(vbasedev->name);
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 4bb43f18f315..bdf40ba368c7 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -151,3 +151,9 @@ vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_st
>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>  vfio_save_setup(char *name) " (%s)"
>  vfio_save_cleanup(char *name) " (%s)"
> +vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
> +vfio_update_pending(char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
> +vfio_save_device_config_state(char *name) " (%s)"
> +vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
> +vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
> +vfio_save_complete_precopy(char *name) " (%s)"
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-07-16 21:14   ` Alex Williamson
@ 2019-07-17  9:10     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-17  9:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, yan.y.zhao,
	changpeng.liu, Ken.Xue

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Tue, 9 Jul 2019 15:19:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > These functions save and restore PCI device specific data - config
> > space of PCI device.
> > Tested save and restore with MSI and MSIX type.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  hw/vfio/pci.c                 | 114 ++++++++++++++++++++++++++++++++++++++++++
> >  include/hw/vfio/vfio-common.h |   2 +
> >  2 files changed, 116 insertions(+)
> > 
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index de0d286fc9dd..5fe4f8076cac 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -2395,11 +2395,125 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >      return OBJECT(vdev);
> >  }
> >  
> > +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > +{
> > +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint16_t pci_cmd;
> > +    int i;
> > +
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        uint32_t bar;
> > +
> > +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > +        qemu_put_be32(f, bar);
> > +    }
> > +
> > +    qemu_put_be32(f, vdev->interrupt);
> > +    if (vdev->interrupt == VFIO_INT_MSI) {
> > +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > +        bool msi_64bit;
> > +
> > +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +                                            2);
> > +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > +
> > +        msi_addr_lo = pci_default_read_config(pdev,
> > +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > +        qemu_put_be32(f, msi_addr_lo);
> > +
> > +        if (msi_64bit) {
> > +            msi_addr_hi = pci_default_read_config(pdev,
> > +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                                             4);
> > +        }
> > +        qemu_put_be32(f, msi_addr_hi);
> > +
> > +        msi_data = pci_default_read_config(pdev,
> > +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +                2);
> > +        qemu_put_be32(f, msi_data);
> > +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> > +        uint16_t offset;
> > +
> > +        /* save enable bit and maskall bit */
> > +        offset = pci_default_read_config(pdev,
> > +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > +        qemu_put_be16(f, offset);
> > +        msix_save(pdev, f);
> > +    }
> > +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > +    qemu_put_be16(f, pci_cmd);
> > +}
> > +
> > +static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > +{
> > +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t interrupt_type;
> > +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > +    uint16_t pci_cmd;
> > +    bool msi_64bit;
> > +    int i;
> > +
> > +    /* retore pci bar configuration */
> > +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        uint32_t bar = qemu_get_be32(f);
> > +
> > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > +    }
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> > +
> > +    interrupt_type = qemu_get_be32(f);
> > +
> > +    if (interrupt_type == VFIO_INT_MSI) {
> > +        /* restore msi configuration */
> > +        msi_flags = pci_default_read_config(pdev,
> > +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > +
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> > +
> > +        msi_addr_lo = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> > +                              msi_addr_lo, 4);
> > +
> > +        msi_addr_hi = qemu_get_be32(f);
> > +        if (msi_64bit) {
> > +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                                  msi_addr_hi, 4);
> > +        }
> > +        msi_data = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev,
> > +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +                msi_data, 2);
> > +
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> > +    } else if (interrupt_type == VFIO_INT_MSIX) {
> > +        uint16_t offset = qemu_get_be16(f);
> > +
> > +        /* load enable bit and maskall bit */
> > +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> > +                              offset, 2);
> > +        msix_load(pdev, f);
> > +    }
> > +    pci_cmd = qemu_get_be16(f);
> > +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> > +}
> 
> Pardon my migration ignorance, but there are bound to be more fields
> and capabilities that get migrated over time.  How does this get
> extended to support that and maintain backwards compatibility?  Thanks,

You normally tie those fields to a property on your device and set the
property in the machine type so that newer machine types send the extra
fields.

Dave

> Alex
> 
> > +
> >  static VFIODeviceOps vfio_pci_ops = {
> >      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
> >      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
> >      .vfio_eoi = vfio_intx_eoi,
> >      .vfio_get_object = vfio_pci_get_object,
> > +    .vfio_save_config = vfio_pci_save_config,
> > +    .vfio_load_config = vfio_pci_load_config,
> >  };
> >  
> >  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index 771b6d59a3db..ee72bd984a36 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -120,6 +120,8 @@ struct VFIODeviceOps {
> >      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
> >      void (*vfio_eoi)(VFIODevice *vdev);
> >      Object *(*vfio_get_object)(VFIODevice *vdev);
> > +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> > +    void (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
> >  };
> >  
> >  typedef struct VFIOGroup {
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface
  2019-07-16 20:56   ` Alex Williamson
@ 2019-07-17 11:55     ` Cornelia Huck
  2019-07-23 12:13     ` Cornelia Huck
  2019-08-21 20:31     ` Kirti Wankhede
  2 siblings, 0 replies; 77+ messages in thread
From: Cornelia Huck @ 2019-07-17 11:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, qemu-devel,
	Zhengxiao.zx, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, Kirti Wankhede, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 16 Jul 2019 14:56:32 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 9 Jul 2019 15:19:08 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:

> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index 24f505199f83..6696a4600545 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -372,6 +372,172 @@ struct vfio_region_gfx_edid {
> >   */
> >  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> >  
> > +/* Migration region type and sub-type */
> > +#define VFIO_REGION_TYPE_MIGRATION	        (2)  
> 
> Region type #2 is already claimed by VFIO_REGION_TYPE_CCW, so this would
> need to be #3 or greater (we should have a reference table somewhere in
> this header as it gets easier to miss claimed entries as the sprawl
> grows).

I agree, this is too easy to miss. I came up with "vfio: re-arrange
vfio region definitions" (<20190717114956.16263-1-cohuck@redhat.com>),
maybe that helps a bit.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-12  0:32           ` Yan Zhao
@ 2019-07-18 18:32             ` Kirti Wankhede
  2019-07-19  1:23               ` Yan Zhao
  0 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-18 18:32 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, Ken.Xue,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, qemu-devel,
	Dr. David Alan Gilbert, pasic, aik, alex.williamson, eauger,
	cohuck, jonathan.davies, felipe, mlevitsk, Liu, Changpeng, Wang,
	Zhi A


On 7/12/2019 6:02 AM, Yan Zhao wrote:
> On Fri, Jul 12, 2019 at 03:08:31AM +0800, Kirti Wankhede wrote:
>>
>>
>> On 7/11/2019 9:53 PM, Dr. David Alan Gilbert wrote:
>>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
>>>> On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
>>>>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
>>>>>> Hi Kirti,
>>>>>> There are still unaddressed comments to your patches v4.
>>>>>> Would you mind addressing them?
>>>>>>
>>>>>> 1. should we register two migration interfaces simultaneously
>>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
>>>>>
>>>>> Please don't do this.
>>>>> As far as I'm aware we currently only have one device that does that
>>>>> (vmxnet3) and a patch has just been posted that fixes/removes that.
>>>>>
>>>>> Dave
>>>>>
>>>> hi Dave,
>>>> Thanks for notifying this. but if we want to support postcopy in future,
>>>> after device stops, what interface could we use to transfer data of
>>>> device state only?
>>>> for postcopy, when source device stops, we need to transfer only
>>>> necessary device state to target vm before target vm starts, and we
>>>> don't want to transfer device memory as we'll do that after target vm
>>>> resuming.
>>>
>>> Hmm ok, lets see; that's got to happen in the call to:
>>>     qemu_savevm_state_complete_precopy(fb, false, false);
>>> that's made from postcopy_start.
>>>  (the false's are iterable_only and inactivate_disks)
>>>
>>> and at that time I believe the state is POSTCOPY_ACTIVE, so in_postcopy
>>> is true.
>>>
>>> If you're doing postcopy, then you'll probably define a has_postcopy()
>>> function, so qemu_savevm_state_complete_precopy will skip the
>>> save_live_complete_precopy call from it's loop for at least two of the
>>> reasons in it's big if.
>>>
>>> So you're right; you need the VMSD for this to happen in the second
>>> loop in qemu_savevm_state_complete_precopy.  Hmm.
>>>
>>> Now, what worries me, and I don't know the answer, is how the section
>>> header for the vmstate and the section header for an iteration look
>>> on the stream; how are they different?
>>>
>>
>> I don't have way to test postcopy migration - is one of the major reason
>> I had not included postcopy support in this patchset and clearly called
>> out in cover letter.
>> This patchset is thoroughly tested for precopy migration.
>> If anyone have hardware that supports fault, then I would prefer to add
>> postcopy support as incremental change later which can be tested before
>> submitting.
>>
>> Just a suggestion, instead of using VMSD, is it possible to have some
>> additional check to call save_live_complete_precopy from
>> qemu_savevm_state_complete_precopy?
>>
>>
>>>>
>>>>>> 2. in each save iteration, how much data is to be saved
>>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
>>
>>> how big is the data_size ?
>>> if this size is too big, it may take too much time and block others.
>>
>> I do had mentioned this in the comment about the structure in vfio.h
>> header. data_size will be provided by vendor driver and obviously will
>> not be greater that migration region size. Vendor driver should be
>> responsible to keep its solution optimized.
>>
> if the data_size is no big than migration region size, and each
> iteration only saves data of data_size, i'm afraid it will cause
> prolonged down time. after all, the migration region size cannot be very
> big.

As I mentioned above, its vendor driver responsibility to keep its
solution optimized.
A good behaving vendor driver should not cause unnecessary prolonged
down time.

> Also, if vendor driver determines how much data to save in each
> iteration alone, and no checks in qemu, it may cause other devices'
> migration time be squeezed.
> 

Sorry, how will that squeeze other device's migration time?

>>
>>>>>> 3. do we need extra interface to get data for device state only
>>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
>>
>> I don't think so. Opaque Device data from vendor driver can include
>> device state and device memory. Vendor driver who is managing his device
>> can decide how to place data over the stream.
>>
> I know current design is opaque device data. then to support postcopy,
> we may have to add extra device state like in-postcopy. but postcopy is
> more like a qemu state and is not a device state.

One bit from device_state can be used to inform vendor driver about
postcopy, when postcopy support will be added.

> to address it elegantly, may we add an extra interface besides
> vfio_save_buffer() to get data for device state only?
> 

When in postcopy state, based on device_state flag, vendor driver can
copy device state first in migration region, I still think there is no
need to separate device state and device memory.

>>>>>> 4. definition of dirty page copied_pfn
>>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
>>>>>>
>>
>> This was inline to discussion going with Alex. I addressed the concern
>> there. Please check current patchset, which addresses the concerns raised.
>>
> ok. I saw you also updated the flow in the part. please check my comment
> in that patch for detail. but as a suggestion, I think processed_pfns is
> a better name compared to copied_pfns :)
> 

Vendor driver can process total_pfns, but can copy only some pfns bitmap
to migration region. One of the reason could be the size of migration
region is not able to accommodate bitmap of total_pfns size. So it could
be like: 0 < copied_pfns < total_pfns. That's why the name
'copied_pfns'. I'll continue with 'copied_pfns'.

Thanks,
Kirti

>>>>>> Also, I'm glad to see that you updated code by following my comments below,
>>>>>> but please don't forget to reply my comments next time:)
>>
>> I tried to reply top of threads and addressed common concerns raised in
>> that. Sorry If I missed any, I'll make sure to point you to my replies
>> going ahead.
>>
> ok. let's cooperate:)
> 
> Thanks
> Yan
> 
>> Thanks,
>> Kirti
>>
>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
>>>>>>
>>>>>> Thanks
>>>>>> Yan
>>>>>>
>>>>>> On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
>>>>>>> Add migration support for VFIO device
>>>>>>>
>>>>>>> This Patch set include patches as below:
>>>>>>> - Define KABI for VFIO device for migration support.
>>>>>>> - Added save and restore functions for PCI configuration space
>>>>>>> - Generic migration functionality for VFIO device.
>>>>>>>   * This patch set adds functionality only for PCI devices, but can be
>>>>>>>     extended to other VFIO devices.
>>>>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
>>>>>>>     resume phases of migration.
>>>>>>>   * Added state change notifier and from that notifier function, VFIO
>>>>>>>     device's state changed is conveyed to VFIO device driver.
>>>>>>>   * During save setup phase and resume/load setup phase, migration region
>>>>>>>     is queried and is used to read/write VFIO device data.
>>>>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
>>>>>>>     functionality of iteration during pre-copy phase.
>>>>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>>>>>>>     iteration to read data from VFIO device driver is implemented till pending
>>>>>>>     bytes returned by driver are not zero.
>>>>>>>   * Added function to get dirty pages bitmap for the pages which are used by
>>>>>>>     driver.
>>>>>>> - Add vfio_listerner_log_sync to mark dirty pages.
>>>>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
>>>>>>>   driver, migration is blocked.
>>>>>>>
>>>>>>> Below is the flow of state change for live migration where states in brackets
>>>>>>> represent VM state, migration state and VFIO device state as:
>>>>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
>>>>>>>
>>>>>>> Live migration save path:
>>>>>>>         QEMU normal running state
>>>>>>>         (RUNNING, _NONE, _RUNNING)
>>>>>>>                         |
>>>>>>>     migrate_init spawns migration_thread.
>>>>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
>>>>>>>     Migration thread then calls each device's .save_setup()
>>>>>>>                         |
>>>>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
>>>>>>>     If device is active, get pending bytes by .save_live_pending()
>>>>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
>>>>>>>     Data of VFIO device for pre-copy phase is copied.
>>>>>>>     Iterate till pending bytes converge and are less than threshold
>>>>>>>                         |
>>>>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
>>>>>>>     for each active device. VFIO device is then transitioned in
>>>>>>>      _SAVING state.
>>>>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
>>>>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
>>>>>>>     pending data is 0.
>>>>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
>>>>>>>                         |
>>>>>>>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
>>>>>>>     Migraton thread schedule cleanup bottom half and exit
>>>>>>>
>>>>>>> Live migration resume path:
>>>>>>>     Incomming migration calls .load_setup for each device
>>>>>>>     (RESTORE_VM, _ACTIVE, STOPPED)
>>>>>>>                         |
>>>>>>>     For each device, .load_state is called for that device section data
>>>>>>>                         |
>>>>>>>     At the end, called .load_cleanup for each device and vCPUs are started.
>>>>>>>                         |
>>>>>>>         (RUNNING, _NONE, _RUNNING)
>>>>>>>
>>>>>>> Note that:
>>>>>>> - Migration post copy is not supported.
>>>>>>>
>>>>>>> v6 -> v7:
>>>>>>> - Fix build failures.
>>>>>>>
>>>>>>> v5 -> v6:
>>>>>>> - Fix build failure.
>>>>>>>
>>>>>>> v4 -> v5:
>>>>>>> - Added decriptive comment about the sequence of access of members of structure
>>>>>>>   vfio_device_migration_info to be followed based on Alex's suggestion
>>>>>>> - Updated get dirty pages sequence.
>>>>>>> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
>>>>>>>   get_object, save_config and load_config.
>>>>>>> - Fixed multiple nit picks.
>>>>>>> - Tested live migration with multiple vfio device assigned to a VM.
>>>>>>>
>>>>>>> v3 -> v4:
>>>>>>> - Added one more bit for _RESUMING flag to be set explicitly.
>>>>>>> - data_offset field is read-only for user space application.
>>>>>>> - data_size is read for every iteration before reading data from migration, that
>>>>>>>   is removed assumption that data will be till end of migration region.
>>>>>>> - If vendor driver supports mappable sparsed region, map those region during
>>>>>>>   setup state of save/load, similarly unmap those from cleanup routines.
>>>>>>> - Handles race condition that causes data corruption in migration region during
>>>>>>>   save device state by adding mutex and serialiaing save_buffer and
>>>>>>>   get_dirty_pages routines.
>>>>>>> - Skip called get_dirty_pages routine for mapped MMIO region of device.
>>>>>>> - Added trace events.
>>>>>>> - Splitted into multiple functional patches.
>>>>>>>
>>>>>>> v2 -> v3:
>>>>>>> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
>>>>>>> - Re-structured vfio_device_migration_info to keep it minimal and defined action
>>>>>>>   on read and write access on its members.
>>>>>>>
>>>>>>> v1 -> v2:
>>>>>>> - Defined MIGRATION region type and sub-type which should be used with region
>>>>>>>   type capability.
>>>>>>> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
>>>>>>>   offset of migration region.
>>>>>>> - Replaced ioctl with read/write for trapped part of migration region.
>>>>>>> - Added both type of access support, trapped or mmapped, for data section of the
>>>>>>>   region.
>>>>>>> - Moved PCI device functions to pci file.
>>>>>>> - Added iteration to get dirty page bitmap until bitmap for all requested pages
>>>>>>>   are copied.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kirti
>>>>>>>
>>>>>>> Kirti Wankhede (13):
>>>>>>>   vfio: KABI for migration interface
>>>>>>>   vfio: Add function to unmap VFIO region
>>>>>>>   vfio: Add vfio_get_object callback to VFIODeviceOps
>>>>>>>   vfio: Add save and load functions for VFIO PCI devices
>>>>>>>   vfio: Add migration region initialization and finalize function
>>>>>>>   vfio: Add VM state change handler to know state of VM
>>>>>>>   vfio: Add migration state change notifier
>>>>>>>   vfio: Register SaveVMHandlers for VFIO device
>>>>>>>   vfio: Add save state functions to SaveVMHandlers
>>>>>>>   vfio: Add load state functions to SaveVMHandlers
>>>>>>>   vfio: Add function to get dirty page list
>>>>>>>   vfio: Add vfio_listerner_log_sync to mark dirty pages
>>>>>>>   vfio: Make vfio-pci device migration capable.
>>>>>>>
>>>>>>>  hw/vfio/Makefile.objs         |   2 +-
>>>>>>>  hw/vfio/common.c              |  55 +++
>>>>>>>  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>  hw/vfio/pci.c                 | 137 ++++++-
>>>>>>>  hw/vfio/trace-events          |  19 +
>>>>>>>  include/hw/vfio/vfio-common.h |  25 ++
>>>>>>>  linux-headers/linux/vfio.h    | 166 ++++++++
>>>>>>>  7 files changed, 1271 insertions(+), 7 deletions(-)
>>>>>>>  create mode 100644 hw/vfio/migration.c
>>>>>>>
>>>>>>> -- 
>>>>>>> 2.7.0
>>>>>>>
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list
  2019-07-12  0:33   ` Yan Zhao
@ 2019-07-18 18:39     ` Kirti Wankhede
  2019-07-19  1:24       ` Yan Zhao
  0 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-18 18:39 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 7/12/2019 6:03 AM, Yan Zhao wrote:
> On Tue, Jul 09, 2019 at 05:49:18PM +0800, Kirti Wankhede wrote:
>> Dirty page tracking (.log_sync) is part of RAM copying state, where
>> vendor driver provides the bitmap of pages which are dirtied by vendor
>> driver through migration region and as part of RAM copy, those pages
>> gets copied to file stream.
>>
>> To get dirty page bitmap:
>> - write start address, page_size and pfn count.
>> - read count of pfns copied.
>>     - Vendor driver should return 0 if driver doesn't have any page to
>>       report dirty in given range.
>>     - Vendor driver should return -1 to mark all pages dirty for given range.
>> - read data_offset, where vendor driver has written bitmap.
>> - read bitmap from the region or mmaped part of the region.
>> - Iterate above steps till page bitmap for all requested pfns are copied.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c           | 123 ++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/vfio/vfio-common.h |   2 +
>>  3 files changed, 126 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 5fb4c5329ede..ca1a8c0f5f1f 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -269,6 +269,129 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>      return qemu_file_get_error(f);
>>  }
>>  
>> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
>> +                              uint64_t start_pfn,
>> +                              uint64_t pfn_count,
>> +                              uint64_t page_size)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint64_t count = 0;
>> +    int64_t copied_pfns = 0;
>> +    int64_t total_pfns = pfn_count;
>> +    int ret;
>> +
>> +    qemu_mutex_lock(&migration->lock);
>> +
>> +    while (total_pfns > 0) {
>> +        uint64_t bitmap_size, data_offset = 0;
>> +        uint64_t start = start_pfn + count;
>> +        void *buf = NULL;
>> +        bool buffer_mmaped = false;
>> +
>> +        ret = pwrite(vbasedev->fd, &start, sizeof(start),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              start_pfn));
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to set dirty pages start address %d %s",
>> +                         vbasedev->name, ret, strerror(errno));
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              page_size));
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to set dirty page size %d %s",
>> +                         vbasedev->name, ret, strerror(errno));
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        ret = pwrite(vbasedev->fd, &total_pfns, sizeof(total_pfns),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              total_pfns));
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to set dirty page total pfns %d %s",
>> +                         vbasedev->name, ret, strerror(errno));
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        /* Read copied dirty pfns */
>> +        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             copied_pfns));
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to get dirty pages bitmap count %d %s",
>> +                         vbasedev->name, ret, strerror(errno));
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_NONE) {
>> +            /*
>> +             * copied_pfns could be 0 if driver doesn't have any page to
>> +             * report dirty in given range
>> +             */
>> +            break;
>> +        } else if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_ALL) {
>> +            /* Mark all pages dirty for this range */
>> +            cpu_physical_memory_set_dirty_range(start_pfn * page_size,
>> +                                                pfn_count * page_size,
>> +                                                DIRTY_MEMORY_MIGRATION);
> seesm pfn_count here is not right

Changing it to total_pfns in next version

Thanks,
Kirti

>> +            break;
>> +        }
>> +
>> +        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
>> +
>> +        ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +        if (ret != sizeof(data_offset)) {
>> +            error_report("%s: Failed to get migration buffer data offset %d",
>> +                         vbasedev->name, ret);
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        if (region->mmaps) {
>> +            buf = find_data_region(region, data_offset, bitmap_size);
>> +        }
>> +
>> +        buffer_mmaped = (buf != NULL) ? true : false;
>> +
>> +        if (!buffer_mmaped) {
>> +            buf = g_try_malloc0(bitmap_size);
>> +            if (!buf) {
>> +                error_report("%s: Error allocating buffer ", __func__);
>> +                goto dpl_unlock;
>> +            }
>> +
>> +            ret = pread(vbasedev->fd, buf, bitmap_size,
>> +                        region->fd_offset + data_offset);
>> +            if (ret != bitmap_size) {
>> +                error_report("%s: Failed to get dirty pages bitmap %d",
>> +                             vbasedev->name, ret);
>> +                g_free(buf);
>> +                goto dpl_unlock;
>> +            }
>> +        }
>> +
>> +        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>> +                                               (start_pfn + count) * page_size,
>> +                                                copied_pfns);
>> +        count      += copied_pfns;
>> +        total_pfns -= copied_pfns;
>> +
>> +        if (!buffer_mmaped) {
>> +            g_free(buf);
>> +        }
>> +    }
>> +
>> +    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
>> +                                   page_size);
>> +
>> +dpl_unlock:
>> +    qemu_mutex_unlock(&migration->lock);
>> +}
>> +
>>  /* ---------------------------------------------------------------------- */
>>  
>>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index ac065b559f4e..414a5e69ec5e 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
>>  vfio_load_device_config_state(char *name) " (%s)"
>>  vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
>>  vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
>> +vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index a022484d2636..dc1b83a0b4ef 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -222,5 +222,7 @@ int vfio_spapr_remove_window(VFIOContainer *container,
>>  
>>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
>>  void vfio_migration_finalize(VFIODevice *vbasedev);
>> +void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
>> +                               uint64_t pfn_count, uint64_t page_size);
>>  
>>  #endif /* HW_VFIO_VFIO_COMMON_H */
>> -- 
>> 2.7.0
>>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers
  2019-07-12  2:44   ` Yan Zhao
@ 2019-07-18 18:45     ` Kirti Wankhede
  0 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-18 18:45 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 7/12/2019 8:14 AM, Yan Zhao wrote:
> On Tue, Jul 09, 2019 at 05:49:16PM +0800, Kirti Wankhede wrote:
>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>> functions. These functions handles pre-copy and stop-and-copy phase.
>>
>> In _SAVING|_RUNNING device state or pre-copy phase:
>> - read pending_bytes
>> - read data_offset - indicates kernel driver to write data to staging
>>   buffer which is mmapped.
>> - read data_size - amount of data in bytes written by vendor driver in migration
>>   region.
>> - if data section is trapped, pread() from data_offset of data_size.
>> - if data section is mmaped, read mmaped buffer of data_size.
>> - Write data packet to file stream as below:
>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>> VFIO_MIG_FLAG_END_OF_STATE }
>>
>> In _SAVING device state or stop-and-copy phase
>> a. read config space of device and save to migration file stream. This
>>    doesn't need to be from vendor driver. Any other special config state
>>    from driver can be saved as data in following iteration.
>> b. read pending_bytes
>> c. read data_offset - indicates kernel driver to write data to staging
>>    buffer which is mmapped.
>> d. read data_size - amount of data in bytes written by vendor driver in
>>    migration region.
>> e. if data section is trapped, pread() from data_offset of data_size.
>> f. if data section is mmaped, read mmaped buffer of data_size.
>> g. Write data packet as below:
>>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>> h. iterate through steps b to g while (pending_bytes > 0)
>> i. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>
>> When data region is mapped, its user's responsibility to read data from
>> data_offset of data_size before moving to next steps.
>>
>> .save_live_iterate runs outside the iothread lock in the migration case, which
>> could race with asynchronous call to get dirty page list causing data corruption
>> in mapped migration region. Mutex added here to serial migration buffer read
>> operation.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c  | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/trace-events |   6 ++
>>  2 files changed, 252 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 0597a45fda2d..4e9b4cce230b 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -117,6 +117,138 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>      return 0;
>>  }
>>  
>> +static void *find_data_region(VFIORegion *region,
>> +                              uint64_t data_offset,
>> +                              uint64_t data_size)
>> +{
>> +    void *ptr = NULL;
>> +    int i;
>> +
>> +    for (i = 0; i < region->nr_mmaps; i++) {
>> +        if ((data_offset >= region->mmaps[i].offset) &&
>> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
>> +            (data_size <= region->mmaps[i].size)) {
>> +            ptr = region->mmaps[i].mmap + (data_offset -
>> +                                           region->mmaps[i].offset);
>> +            break;
>> +        }
>> +    }
>> +    return ptr;
>> +}
>> +
>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint64_t data_offset = 0, data_size = 0;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +    if (ret != sizeof(data_offset)) {
>> +        error_report("%s: Failed to get migration buffer data offset %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_size));
>> +    if (ret != sizeof(data_size)) {
>> +        error_report("%s: Failed to get migration buffer data size %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (data_size > 0) {
>> +        void *buf = NULL;
>> +        bool buffer_mmaped;
>> +
>> +        if (region->mmaps) {
>> +            buf = find_data_region(region, data_offset, data_size);
>> +        }
>> +
>> +        buffer_mmaped = (buf != NULL) ? true : false;
>> +
>> +        if (!buffer_mmaped) {
>> +            buf = g_try_malloc0(data_size);
>> +            if (!buf) {
>> +                error_report("%s: Error allocating buffer ", __func__);
>> +                return -ENOMEM;
>> +            }
>> +
>> +            ret = pread(vbasedev->fd, buf, data_size,
>> +                        region->fd_offset + data_offset);
>> +            if (ret != data_size) {
>> +                error_report("%s: Failed to get migration data %d",
>> +                             vbasedev->name, ret);
>> +                g_free(buf);
>> +                return -EINVAL;
>> +            }
>> +        }
>> +
>> +        qemu_put_be64(f, data_size);
>> +        qemu_put_buffer(f, buf, data_size);
>> +
>> +        if (!buffer_mmaped) {
>> +            g_free(buf);
>> +        }
>> +        migration->pending_bytes -= data_size;
> This line "migration->pending_bytes -= data_size;" is not necessary, as
> it will be updated anyway in vfio_update_pending()
> 

Right, removing it.

Thanks,
Kirti



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 02/13] vfio: Add function to unmap VFIO region
  2019-07-16 16:29   ` Cornelia Huck
@ 2019-07-18 18:54     ` Kirti Wankhede
  0 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-18 18:54 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	alex.williamson, eauger, felipe, jonathan.davies, yan.y.zhao,
	mlevitsk, changpeng.liu, zhi.a.wang



On 7/16/2019 9:59 PM, Cornelia Huck wrote:
> On Tue, 9 Jul 2019 15:19:09 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> This function is used in follwing patch in this series.
> 
> "This function will be used for the migration region." ?
> 
> ("This series" will be a bit confusing when this has been merged :)
> 

Sure. Changing it in next version.

Thanks,
Kirti

>> Migration region is mmaped when migration starts and will be unmapped when
>> migration is complete.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/common.c              | 20 ++++++++++++++++++++
>>  hw/vfio/trace-events          |  1 +
>>  include/hw/vfio/vfio-common.h |  1 +
>>  3 files changed, 22 insertions(+)
> 
> Reviewed-by: Cornelia Huck <cohuck@redhat.com>
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 10/13] vfio: Add load state functions to SaveVMHandlers
  2019-07-12  2:52   ` Yan Zhao
@ 2019-07-18 19:00     ` Kirti Wankhede
  2019-07-22  3:20       ` Yan Zhao
  0 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-18 19:00 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	alex.williamson, eauger, cohuck, jonathan.davies, felipe,
	mlevitsk, Liu, Changpeng, Wang, Zhi A



On 7/12/2019 8:22 AM, Yan Zhao wrote:
> On Tue, Jul 09, 2019 at 05:49:17PM +0800, Kirti Wankhede wrote:
>> Flow during _RESUMING device state:
>> - If Vendor driver defines mappable region, mmap migration region.
>> - Load config state.
>> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
>>     - read data_size from packet, read buffer of data_size
>>     - read data_offset from where QEMU should write data.
>>         if region is mmaped, write data of data_size to mmaped region.
>>     - write data_size.
>>         In case of mmapped region, write to data_size indicates kernel
>>         driver that data is written in staging buffer.
>>     - if region is trapped, pwrite() data of data_size from data_offset.
>> - Repeat above until VFIO_MIG_FLAG_END_OF_STATE.
>> - Unmap migration region.
>>
>> For user, data is opaque. User should write data in the same order as
>> received.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c  | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/trace-events |   3 +
>>  2 files changed, 165 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 4e9b4cce230b..5fb4c5329ede 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -249,6 +249,26 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>>      return qemu_file_get_error(f);
>>  }
>>  
>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    uint64_t data;
>> +
>> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>> +        vbasedev->ops->vfio_load_config(vbasedev, f);
>> +    }
>> +
>> +    data = qemu_get_be64(f);
>> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
>> +        error_report("%s: Failed loading device config space, "
>> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
>> +        return -EINVAL;
>> +    }
>> +
>> +    trace_vfio_load_device_config_state(vbasedev->name);
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>  /* ---------------------------------------------------------------------- */
>>  
>>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -421,12 +441,154 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>      return ret;
>>  }
>>  
>> +static int vfio_load_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret = 0;
>> +
>> +    if (migration->region.buffer.mmaps) {
>> +        ret = vfio_region_mmap(&migration->region.buffer);
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.index,
>> +                         strerror(-ret));
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
>> +    }
>> +    return ret;
>> +}
>> +
>> +static int vfio_load_cleanup(void *opaque)
>> +{
>> +    vfio_save_cleanup(opaque);
>> +    return 0;
>> +}
>> +
>> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret = 0;
>> +    uint64_t data, data_size;
>> +
> I think checking of version_id is still needed.
> 

Checking version_id with what value?

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 05/13] vfio: Add migration region initialization and finalize function
  2019-07-16 21:37   ` Alex Williamson
@ 2019-07-18 20:19     ` Kirti Wankhede
  0 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-07-18 20:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	eauger, cohuck, jonathan.davies, felipe, mlevitsk, changpeng.liu,
	zhi.a.wang, yan.y.zhao



On 7/17/2019 3:07 AM, Alex Williamson wrote:
> On Tue, 9 Jul 2019 15:19:12 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
>>   patch series.
>> - VFIO device supports migration or not is decided based of migration region
>>   query. If migration region query is successful and migration region
>>   initialization is successful then migration is supported else migration is
>>   blocked.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/Makefile.objs         |   2 +-
>>  hw/vfio/migration.c           | 145 ++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |   3 +
>>  include/hw/vfio/vfio-common.h |  14 ++++
>>  4 files changed, 163 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/vfio/migration.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index abad8b818c9b..36033d1437c5 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,4 +1,4 @@
>> -obj-y += common.o spapr.o
>> +obj-y += common.o spapr.o migration.o
>>  obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
>>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>>  obj-$(CONFIG_VFIO_PLATFORM) += platform.o
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> new file mode 100644
>> index 000000000000..a2cfbd5af2e1
>> --- /dev/null
>> +++ b/hw/vfio/migration.c
>> @@ -0,0 +1,145 @@
>> +/*
>> + * Migration support for VFIO devices
>> + *
>> + * Copyright NVIDIA, Inc. 2019
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2. See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "cpu.h"
>> +#include "migration/migration.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/register.h"
>> +#include "migration/blocker.h"
>> +#include "migration/misc.h"
>> +#include "qapi/error.h"
>> +#include "exec/ramlist.h"
>> +#include "exec/ram_addr.h"
>> +#include "pci.h"
>> +#include "trace.h"
>> +
>> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (!migration) {
>> +        return;
>> +    }
>> +
>> +    if (migration->region.buffer.size) {
> 
> Having a VFIORegion named buffer within a struct named region is
> unnecessarily confusing.  Please fix.
> 

Adding this fix in next version.

>> +        vfio_region_exit(&migration->region.buffer);
>> +        vfio_region_finalize(&migration->region.buffer);
>> +    }
>> +}
>> +
>> +static int vfio_migration_region_init(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    Object *obj = NULL;
>> +    int ret = -EINVAL;
>> +
>> +    if (!migration) {
>> +        return ret;
>> +    }
>> +
>> +    if (!vbasedev->ops || !vbasedev->ops->vfio_get_object) {
>> +        return ret;
>> +    }
>> +
>> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
>> +    if (!obj) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
>> +                            migration->region.index, "migration");
>> +    if (ret) {
>> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
>> +                     vbasedev->name, migration->region.index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    if (!migration->region.buffer.size) {
>> +        ret = -EINVAL;
>> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
>> +                     vbasedev->name, migration->region.index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    return 0;
>> +
>> +err:
>> +    vfio_migration_region_exit(vbasedev);
>> +    return ret;
>> +}
>> +
>> +static int vfio_migration_init(VFIODevice *vbasedev,
>> +                               struct vfio_region_info *info)
>> +{
>> +    int ret;
>> +
>> +    vbasedev->migration = g_new0(VFIOMigration, 1);
>> +    vbasedev->migration->region.index = info->index;
> 
> VFIORegion already caches the region index, VFIORegion.nr.
> 
>> +
>> +    ret = vfio_migration_region_init(vbasedev);
>> +    if (ret) {
>> +        error_report("%s: Failed to initialise migration region",
>> +                     vbasedev->name);
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    struct vfio_region_info *info;
>> +    Error *local_err = NULL;
>> +    int ret;
>> +
>> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
>> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
>> +    if (ret) {
>> +        goto add_blocker;
>> +    }
>> +
>> +    ret = vfio_migration_init(vbasedev, info);
>> +    if (ret) {
>> +        goto add_blocker;
>> +    }
>> +
>> +    trace_vfio_migration_probe(vbasedev->name, info->index);
>> +    return 0;
>> +
>> +add_blocker:
>> +    error_setg(&vbasedev->migration_blocker,
>> +               "VFIO device doesn't support migration"); 
>> +    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        error_free(vbasedev->migration_blocker);
>> +    }
>> +    return ret;
>> +}
>> +
>> +void vfio_migration_finalize(VFIODevice *vbasedev)
>> +{
>> +    if (!vbasedev->migration) {
>> +        return;
> 
> Don't we allocate migration_blocker even for this case?  Thanks,
> 

Right. Fixing this is next version.

Thanks,
Kirti

> Alex
> 
>> +    }
>> +
>> +    if (vbasedev->migration_blocker) {
>> +        migrate_del_blocker(vbasedev->migration_blocker);
>> +        error_free(vbasedev->migration_blocker);
>> +    }
>> +
>> +    vfio_migration_region_exit(vbasedev);
>> +    g_free(vbasedev->migration);
>> +}
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 8cdc27946cb8..191a726a1312 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
>>  vfio_display_edid_link_down(void) ""
>>  vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>>  vfio_display_edid_write_error(void) ""
>> +
>> +# migration.c
>> +vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index ee72bd984a36..152da3f8d6f3 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -57,6 +57,15 @@ typedef struct VFIORegion {
>>      uint8_t nr; /* cache the region number for debug */
>>  } VFIORegion;
>>  
>> +typedef struct VFIOMigration {
>> +    struct {
>> +        VFIORegion buffer;
>> +        uint32_t index;
>> +    } region;
>> +    uint64_t pending_bytes;
>> +    QemuMutex lock;
>> +} VFIOMigration;
>> +
>>  typedef struct VFIOAddressSpace {
>>      AddressSpace *as;
>>      QLIST_HEAD(, VFIOContainer) containers;
>> @@ -113,6 +122,8 @@ typedef struct VFIODevice {
>>      unsigned int num_irqs;
>>      unsigned int num_regions;
>>      unsigned int flags;
>> +    VFIOMigration *migration;
>> +    Error *migration_blocker;
>>  } VFIODevice;
>>  
>>  struct VFIODeviceOps {
>> @@ -204,4 +215,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>>  int vfio_spapr_remove_window(VFIOContainer *container,
>>                               hwaddr offset_within_address_space);
>>  
>> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
>> +void vfio_migration_finalize(VFIODevice *vbasedev);
>> +
>>  #endif /* HW_VFIO_VFIO_COMMON_H */
> 
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-18 18:32             ` Kirti Wankhede
@ 2019-07-19  1:23               ` Yan Zhao
  2019-07-24 11:32                 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-19  1:23 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, Ken.Xue,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, qemu-devel,
	Dr. David Alan Gilbert, pasic, aik, alex.williamson, eauger,
	cohuck, jonathan.davies, felipe, mlevitsk, Liu, Changpeng, Wang,
	Zhi A

On Fri, Jul 19, 2019 at 02:32:33AM +0800, Kirti Wankhede wrote:
> 
> On 7/12/2019 6:02 AM, Yan Zhao wrote:
> > On Fri, Jul 12, 2019 at 03:08:31AM +0800, Kirti Wankhede wrote:
> >>
> >>
> >> On 7/11/2019 9:53 PM, Dr. David Alan Gilbert wrote:
> >>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> >>>> On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
> >>>>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> >>>>>> Hi Kirti,
> >>>>>> There are still unaddressed comments to your patches v4.
> >>>>>> Would you mind addressing them?
> >>>>>>
> >>>>>> 1. should we register two migration interfaces simultaneously
> >>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
> >>>>>
> >>>>> Please don't do this.
> >>>>> As far as I'm aware we currently only have one device that does that
> >>>>> (vmxnet3) and a patch has just been posted that fixes/removes that.
> >>>>>
> >>>>> Dave
> >>>>>
> >>>> hi Dave,
> >>>> Thanks for notifying this. but if we want to support postcopy in future,
> >>>> after device stops, what interface could we use to transfer data of
> >>>> device state only?
> >>>> for postcopy, when source device stops, we need to transfer only
> >>>> necessary device state to target vm before target vm starts, and we
> >>>> don't want to transfer device memory as we'll do that after target vm
> >>>> resuming.
> >>>
> >>> Hmm ok, lets see; that's got to happen in the call to:
> >>>     qemu_savevm_state_complete_precopy(fb, false, false);
> >>> that's made from postcopy_start.
> >>>  (the false's are iterable_only and inactivate_disks)
> >>>
> >>> and at that time I believe the state is POSTCOPY_ACTIVE, so in_postcopy
> >>> is true.
> >>>
> >>> If you're doing postcopy, then you'll probably define a has_postcopy()
> >>> function, so qemu_savevm_state_complete_precopy will skip the
> >>> save_live_complete_precopy call from it's loop for at least two of the
> >>> reasons in it's big if.
> >>>
> >>> So you're right; you need the VMSD for this to happen in the second
> >>> loop in qemu_savevm_state_complete_precopy.  Hmm.
> >>>
> >>> Now, what worries me, and I don't know the answer, is how the section
> >>> header for the vmstate and the section header for an iteration look
> >>> on the stream; how are they different?
> >>>
> >>
> >> I don't have way to test postcopy migration - is one of the major reason
> >> I had not included postcopy support in this patchset and clearly called
> >> out in cover letter.
> >> This patchset is thoroughly tested for precopy migration.
> >> If anyone have hardware that supports fault, then I would prefer to add
> >> postcopy support as incremental change later which can be tested before
> >> submitting.
> >>
> >> Just a suggestion, instead of using VMSD, is it possible to have some
> >> additional check to call save_live_complete_precopy from
> >> qemu_savevm_state_complete_precopy?
> >>
> >>
> >>>>
> >>>>>> 2. in each save iteration, how much data is to be saved
> >>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
> >>
> >>> how big is the data_size ?
> >>> if this size is too big, it may take too much time and block others.
> >>
> >> I do had mentioned this in the comment about the structure in vfio.h
> >> header. data_size will be provided by vendor driver and obviously will
> >> not be greater that migration region size. Vendor driver should be
> >> responsible to keep its solution optimized.
> >>
> > if the data_size is no big than migration region size, and each
> > iteration only saves data of data_size, i'm afraid it will cause
> > prolonged down time. after all, the migration region size cannot be very
> > big.
> 
> As I mentioned above, its vendor driver responsibility to keep its
> solution optimized.
> A good behaving vendor driver should not cause unnecessary prolonged
> down time.
>
I think vendor data can determine the data_size, but qemu has to decide
how much data to transmit according to the actual transmitting time (or
transmitting speed). when transmitting speed is high, transmit more data
a iteration, and low speed small data a iteration. This transmitting
time knowledge is not available to vendor driver, and even it has this
knowledge, can it dynamically change data region size?
maybe you can say vendor driver can register a big enough region and
dynamically use part of it, but what is big enough and it really costs
memory.

> > Also, if vendor driver determines how much data to save in each
> > iteration alone, and no checks in qemu, it may cause other devices'
> > migration time be squeezed.
> > 
> 
> Sorry, how will that squeeze other device's migration time?
> 
if a vendor driver has extremely big data to transmit, other devices
have to wait until it finishes to transmit their own data. In a given
time period, it can be considered as their time slot being squeezed.
not sure whether it's a good practice.

> >>
> >>>>>> 3. do we need extra interface to get data for device state only
> >>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
> >>
> >> I don't think so. Opaque Device data from vendor driver can include
> >> device state and device memory. Vendor driver who is managing his device
> >> can decide how to place data over the stream.
> >>
> > I know current design is opaque device data. then to support postcopy,
> > we may have to add extra device state like in-postcopy. but postcopy is
> > more like a qemu state and is not a device state.
> 
> One bit from device_state can be used to inform vendor driver about
> postcopy, when postcopy support will be added.
>
ok. if you insist on that, one bit in devie_state is also good, as long
as everyone agrees on it:)

> > to address it elegantly, may we add an extra interface besides
> > vfio_save_buffer() to get data for device state only?
> > 
> 
> When in postcopy state, based on device_state flag, vendor driver can
> copy device state first in migration region, I still think there is no
> need to separate device state and device memory.
>
it's the difference between more device_state flags and more interfaces.

> >>>>>> 4. definition of dirty page copied_pfn
> >>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
> >>>>>>
> >>
> >> This was inline to discussion going with Alex. I addressed the concern
> >> there. Please check current patchset, which addresses the concerns raised.
> >>
> > ok. I saw you also updated the flow in the part. please check my comment
> > in that patch for detail. but as a suggestion, I think processed_pfns is
> > a better name compared to copied_pfns :)
> > 
> 
> Vendor driver can process total_pfns, but can copy only some pfns bitmap
> to migration region. One of the reason could be the size of migration
> region is not able to accommodate bitmap of total_pfns size. So it could
> be like: 0 < copied_pfns < total_pfns. That's why the name
> 'copied_pfns'. I'll continue with 'copied_pfns'.
>
so it's bitmap's pfn count, right? 
besides VFIO_DEVICE_DIRTY_PFNS_NONE, and VFIO_DEVICE_DIRTY_PFNS_ALL to
indicate no dirty and all dirty in total_pfns, why not add two more
flags, e.g.
VFIO_DEVICE_DIRTY_PFNS_ONE_ITERATION,
VFIO_DEVICE_DIRTY_PFNS_ALL_ONE_ITERATION, to skip copying bitmap from
kernel if the copied bitmap in one iteration is all 0 or all 1?

Thanks
Yan
> 
> >>>>>> Also, I'm glad to see that you updated code by following my comments below,
> >>>>>> but please don't forget to reply my comments next time:)
> >>
> >> I tried to reply top of threads and addressed common concerns raised in
> >> that. Sorry If I missed any, I'll make sure to point you to my replies
> >> going ahead.
> >>
> > ok. let's cooperate:)
> > 
> > Thanks
> > Yan
> > 
> >> Thanks,
> >> Kirti
> >>
> >>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
> >>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
> >>>>>>
> >>>>>> Thanks
> >>>>>> Yan
> >>>>>>
> >>>>>> On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> >>>>>>> Add migration support for VFIO device
> >>>>>>>
> >>>>>>> This Patch set include patches as below:
> >>>>>>> - Define KABI for VFIO device for migration support.
> >>>>>>> - Added save and restore functions for PCI configuration space
> >>>>>>> - Generic migration functionality for VFIO device.
> >>>>>>>   * This patch set adds functionality only for PCI devices, but can be
> >>>>>>>     extended to other VFIO devices.
> >>>>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> >>>>>>>     resume phases of migration.
> >>>>>>>   * Added state change notifier and from that notifier function, VFIO
> >>>>>>>     device's state changed is conveyed to VFIO device driver.
> >>>>>>>   * During save setup phase and resume/load setup phase, migration region
> >>>>>>>     is queried and is used to read/write VFIO device data.
> >>>>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> >>>>>>>     functionality of iteration during pre-copy phase.
> >>>>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> >>>>>>>     iteration to read data from VFIO device driver is implemented till pending
> >>>>>>>     bytes returned by driver are not zero.
> >>>>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> >>>>>>>     driver.
> >>>>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> >>>>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> >>>>>>>   driver, migration is blocked.
> >>>>>>>
> >>>>>>> Below is the flow of state change for live migration where states in brackets
> >>>>>>> represent VM state, migration state and VFIO device state as:
> >>>>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> >>>>>>>
> >>>>>>> Live migration save path:
> >>>>>>>         QEMU normal running state
> >>>>>>>         (RUNNING, _NONE, _RUNNING)
> >>>>>>>                         |
> >>>>>>>     migrate_init spawns migration_thread.
> >>>>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> >>>>>>>     Migration thread then calls each device's .save_setup()
> >>>>>>>                         |
> >>>>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> >>>>>>>     If device is active, get pending bytes by .save_live_pending()
> >>>>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> >>>>>>>     Data of VFIO device for pre-copy phase is copied.
> >>>>>>>     Iterate till pending bytes converge and are less than threshold
> >>>>>>>                         |
> >>>>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> >>>>>>>     for each active device. VFIO device is then transitioned in
> >>>>>>>      _SAVING state.
> >>>>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> >>>>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> >>>>>>>     pending data is 0.
> >>>>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> >>>>>>>                         |
> >>>>>>>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> >>>>>>>     Migraton thread schedule cleanup bottom half and exit
> >>>>>>>
> >>>>>>> Live migration resume path:
> >>>>>>>     Incomming migration calls .load_setup for each device
> >>>>>>>     (RESTORE_VM, _ACTIVE, STOPPED)
> >>>>>>>                         |
> >>>>>>>     For each device, .load_state is called for that device section data
> >>>>>>>                         |
> >>>>>>>     At the end, called .load_cleanup for each device and vCPUs are started.
> >>>>>>>                         |
> >>>>>>>         (RUNNING, _NONE, _RUNNING)
> >>>>>>>
> >>>>>>> Note that:
> >>>>>>> - Migration post copy is not supported.
> >>>>>>>
> >>>>>>> v6 -> v7:
> >>>>>>> - Fix build failures.
> >>>>>>>
> >>>>>>> v5 -> v6:
> >>>>>>> - Fix build failure.
> >>>>>>>
> >>>>>>> v4 -> v5:
> >>>>>>> - Added decriptive comment about the sequence of access of members of structure
> >>>>>>>   vfio_device_migration_info to be followed based on Alex's suggestion
> >>>>>>> - Updated get dirty pages sequence.
> >>>>>>> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
> >>>>>>>   get_object, save_config and load_config.
> >>>>>>> - Fixed multiple nit picks.
> >>>>>>> - Tested live migration with multiple vfio device assigned to a VM.
> >>>>>>>
> >>>>>>> v3 -> v4:
> >>>>>>> - Added one more bit for _RESUMING flag to be set explicitly.
> >>>>>>> - data_offset field is read-only for user space application.
> >>>>>>> - data_size is read for every iteration before reading data from migration, that
> >>>>>>>   is removed assumption that data will be till end of migration region.
> >>>>>>> - If vendor driver supports mappable sparsed region, map those region during
> >>>>>>>   setup state of save/load, similarly unmap those from cleanup routines.
> >>>>>>> - Handles race condition that causes data corruption in migration region during
> >>>>>>>   save device state by adding mutex and serialiaing save_buffer and
> >>>>>>>   get_dirty_pages routines.
> >>>>>>> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> >>>>>>> - Added trace events.
> >>>>>>> - Splitted into multiple functional patches.
> >>>>>>>
> >>>>>>> v2 -> v3:
> >>>>>>> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> >>>>>>> - Re-structured vfio_device_migration_info to keep it minimal and defined action
> >>>>>>>   on read and write access on its members.
> >>>>>>>
> >>>>>>> v1 -> v2:
> >>>>>>> - Defined MIGRATION region type and sub-type which should be used with region
> >>>>>>>   type capability.
> >>>>>>> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> >>>>>>>   offset of migration region.
> >>>>>>> - Replaced ioctl with read/write for trapped part of migration region.
> >>>>>>> - Added both type of access support, trapped or mmapped, for data section of the
> >>>>>>>   region.
> >>>>>>> - Moved PCI device functions to pci file.
> >>>>>>> - Added iteration to get dirty page bitmap until bitmap for all requested pages
> >>>>>>>   are copied.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Kirti
> >>>>>>>
> >>>>>>> Kirti Wankhede (13):
> >>>>>>>   vfio: KABI for migration interface
> >>>>>>>   vfio: Add function to unmap VFIO region
> >>>>>>>   vfio: Add vfio_get_object callback to VFIODeviceOps
> >>>>>>>   vfio: Add save and load functions for VFIO PCI devices
> >>>>>>>   vfio: Add migration region initialization and finalize function
> >>>>>>>   vfio: Add VM state change handler to know state of VM
> >>>>>>>   vfio: Add migration state change notifier
> >>>>>>>   vfio: Register SaveVMHandlers for VFIO device
> >>>>>>>   vfio: Add save state functions to SaveVMHandlers
> >>>>>>>   vfio: Add load state functions to SaveVMHandlers
> >>>>>>>   vfio: Add function to get dirty page list
> >>>>>>>   vfio: Add vfio_listerner_log_sync to mark dirty pages
> >>>>>>>   vfio: Make vfio-pci device migration capable.
> >>>>>>>
> >>>>>>>  hw/vfio/Makefile.objs         |   2 +-
> >>>>>>>  hw/vfio/common.c              |  55 +++
> >>>>>>>  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>  hw/vfio/pci.c                 | 137 ++++++-
> >>>>>>>  hw/vfio/trace-events          |  19 +
> >>>>>>>  include/hw/vfio/vfio-common.h |  25 ++
> >>>>>>>  linux-headers/linux/vfio.h    | 166 ++++++++
> >>>>>>>  7 files changed, 1271 insertions(+), 7 deletions(-)
> >>>>>>>  create mode 100644 hw/vfio/migration.c
> >>>>>>>
> >>>>>>> -- 
> >>>>>>> 2.7.0
> >>>>>>>
> >>>>> --
> >>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>
> > 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list
  2019-07-18 18:39     ` Kirti Wankhede
@ 2019-07-19  1:24       ` Yan Zhao
  0 siblings, 0 replies; 77+ messages in thread
From: Yan Zhao @ 2019-07-19  1:24 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Fri, Jul 19, 2019 at 02:39:10AM +0800, Kirti Wankhede wrote:
> 
> 
> On 7/12/2019 6:03 AM, Yan Zhao wrote:
> > On Tue, Jul 09, 2019 at 05:49:18PM +0800, Kirti Wankhede wrote:
> >> Dirty page tracking (.log_sync) is part of RAM copying state, where
> >> vendor driver provides the bitmap of pages which are dirtied by vendor
> >> driver through migration region and as part of RAM copy, those pages
> >> gets copied to file stream.
> >>
> >> To get dirty page bitmap:
> >> - write start address, page_size and pfn count.
> >> - read count of pfns copied.
> >>     - Vendor driver should return 0 if driver doesn't have any page to
> >>       report dirty in given range.
> >>     - Vendor driver should return -1 to mark all pages dirty for given range.
> >> - read data_offset, where vendor driver has written bitmap.
> >> - read bitmap from the region or mmaped part of the region.
> >> - Iterate above steps till page bitmap for all requested pfns are copied.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/migration.c           | 123 ++++++++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/trace-events          |   1 +
> >>  include/hw/vfio/vfio-common.h |   2 +
> >>  3 files changed, 126 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 5fb4c5329ede..ca1a8c0f5f1f 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -269,6 +269,129 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> >>      return qemu_file_get_error(f);
> >>  }
> >>  
> >> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> >> +                              uint64_t start_pfn,
> >> +                              uint64_t pfn_count,
> >> +                              uint64_t page_size)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region.buffer;
> >> +    uint64_t count = 0;
> >> +    int64_t copied_pfns = 0;
> >> +    int64_t total_pfns = pfn_count;
> >> +    int ret;
> >> +
> >> +    qemu_mutex_lock(&migration->lock);
> >> +
> >> +    while (total_pfns > 0) {
> >> +        uint64_t bitmap_size, data_offset = 0;
> >> +        uint64_t start = start_pfn + count;
> >> +        void *buf = NULL;
> >> +        bool buffer_mmaped = false;
> >> +
> >> +        ret = pwrite(vbasedev->fd, &start, sizeof(start),
> >> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                              start_pfn));
> >> +        if (ret < 0) {
> >> +            error_report("%s: Failed to set dirty pages start address %d %s",
> >> +                         vbasedev->name, ret, strerror(errno));
> >> +            goto dpl_unlock;
> >> +        }
> >> +
> >> +        ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
> >> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                              page_size));
> >> +        if (ret < 0) {
> >> +            error_report("%s: Failed to set dirty page size %d %s",
> >> +                         vbasedev->name, ret, strerror(errno));
> >> +            goto dpl_unlock;
> >> +        }
> >> +
> >> +        ret = pwrite(vbasedev->fd, &total_pfns, sizeof(total_pfns),
> >> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                              total_pfns));
> >> +        if (ret < 0) {
> >> +            error_report("%s: Failed to set dirty page total pfns %d %s",
> >> +                         vbasedev->name, ret, strerror(errno));
> >> +            goto dpl_unlock;
> >> +        }
> >> +
> >> +        /* Read copied dirty pfns */
> >> +        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             copied_pfns));
> >> +        if (ret < 0) {
> >> +            error_report("%s: Failed to get dirty pages bitmap count %d %s",
> >> +                         vbasedev->name, ret, strerror(errno));
> >> +            goto dpl_unlock;
> >> +        }
> >> +
> >> +        if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_NONE) {
> >> +            /*
> >> +             * copied_pfns could be 0 if driver doesn't have any page to
> >> +             * report dirty in given range
> >> +             */
> >> +            break;
> >> +        } else if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_ALL) {
> >> +            /* Mark all pages dirty for this range */
> >> +            cpu_physical_memory_set_dirty_range(start_pfn * page_size,
> >> +                                                pfn_count * page_size,
> >> +                                                DIRTY_MEMORY_MIGRATION);
> > seesm pfn_count here is not right
> 
> Changing it to total_pfns in next version
>
if it's total_pfns, then it cannot be in the loop, right?

Thanks
Yan

> Thanks,
> Kirti
> 
> >> +            break;
> >> +        }
> >> +
> >> +        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
> >> +
> >> +        ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data_offset));
> >> +        if (ret != sizeof(data_offset)) {
> >> +            error_report("%s: Failed to get migration buffer data offset %d",
> >> +                         vbasedev->name, ret);
> >> +            goto dpl_unlock;
> >> +        }
> >> +
> >> +        if (region->mmaps) {
> >> +            buf = find_data_region(region, data_offset, bitmap_size);
> >> +        }
> >> +
> >> +        buffer_mmaped = (buf != NULL) ? true : false;
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            buf = g_try_malloc0(bitmap_size);
> >> +            if (!buf) {
> >> +                error_report("%s: Error allocating buffer ", __func__);
> >> +                goto dpl_unlock;
> >> +            }
> >> +
> >> +            ret = pread(vbasedev->fd, buf, bitmap_size,
> >> +                        region->fd_offset + data_offset);
> >> +            if (ret != bitmap_size) {
> >> +                error_report("%s: Failed to get dirty pages bitmap %d",
> >> +                             vbasedev->name, ret);
> >> +                g_free(buf);
> >> +                goto dpl_unlock;
> >> +            }
> >> +        }
> >> +
> >> +        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> >> +                                               (start_pfn + count) * page_size,
> >> +                                                copied_pfns);
> >> +        count      += copied_pfns;
> >> +        total_pfns -= copied_pfns;
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            g_free(buf);
> >> +        }
> >> +    }
> >> +
> >> +    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
> >> +                                   page_size);
> >> +
> >> +dpl_unlock:
> >> +    qemu_mutex_unlock(&migration->lock);
> >> +}
> >> +
> >>  /* ---------------------------------------------------------------------- */
> >>  
> >>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index ac065b559f4e..414a5e69ec5e 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
> >>  vfio_load_device_config_state(char *name) " (%s)"
> >>  vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
> >>  vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> >> +vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index a022484d2636..dc1b83a0b4ef 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -222,5 +222,7 @@ int vfio_spapr_remove_window(VFIOContainer *container,
> >>  
> >>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> >>  void vfio_migration_finalize(VFIODevice *vbasedev);
> >> +void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
> >> +                               uint64_t pfn_count, uint64_t page_size);
> >>  
> >>  #endif /* HW_VFIO_VFIO_COMMON_H */
> >> -- 
> >> 2.7.0
> >>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 10/13] vfio: Add load state functions to SaveVMHandlers
  2019-07-18 19:00     ` Kirti Wankhede
@ 2019-07-22  3:20       ` Yan Zhao
  2019-07-22 19:07         ` Alex Williamson
  0 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-22  3:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	alex.williamson, eauger, cohuck, jonathan.davies, felipe,
	mlevitsk, Liu, Changpeng, Wang, Zhi A

On Fri, Jul 19, 2019 at 03:00:13AM +0800, Kirti Wankhede wrote:
> 
> 
> On 7/12/2019 8:22 AM, Yan Zhao wrote:
> > On Tue, Jul 09, 2019 at 05:49:17PM +0800, Kirti Wankhede wrote:
> >> Flow during _RESUMING device state:
> >> - If Vendor driver defines mappable region, mmap migration region.
> >> - Load config state.
> >> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
> >>     - read data_size from packet, read buffer of data_size
> >>     - read data_offset from where QEMU should write data.
> >>         if region is mmaped, write data of data_size to mmaped region.
> >>     - write data_size.
> >>         In case of mmapped region, write to data_size indicates kernel
> >>         driver that data is written in staging buffer.
> >>     - if region is trapped, pwrite() data of data_size from data_offset.
> >> - Repeat above until VFIO_MIG_FLAG_END_OF_STATE.
> >> - Unmap migration region.
> >>
> >> For user, data is opaque. User should write data in the same order as
> >> received.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/migration.c  | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/trace-events |   3 +
> >>  2 files changed, 165 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 4e9b4cce230b..5fb4c5329ede 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -249,6 +249,26 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> >>      return qemu_file_get_error(f);
> >>  }
> >>  
> >> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    uint64_t data;
> >> +
> >> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> >> +        vbasedev->ops->vfio_load_config(vbasedev, f);
> >> +    }
> >> +
> >> +    data = qemu_get_be64(f);
> >> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> >> +        error_report("%s: Failed loading device config space, "
> >> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    trace_vfio_load_device_config_state(vbasedev->name);
> >> +    return qemu_file_get_error(f);
> >> +}
> >> +
> >>  /* ---------------------------------------------------------------------- */
> >>  
> >>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> >> @@ -421,12 +441,154 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >>      return ret;
> >>  }
> >>  
> >> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret = 0;
> >> +
> >> +    if (migration->region.buffer.mmaps) {
> >> +        ret = vfio_region_mmap(&migration->region.buffer);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> >> +                         vbasedev->name, migration->region.index,
> >> +                         strerror(-ret));
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING);
> >> +    if (ret) {
> >> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> >> +    }
> >> +    return ret;
> >> +}
> >> +
> >> +static int vfio_load_cleanup(void *opaque)
> >> +{
> >> +    vfio_save_cleanup(opaque);
> >> +    return 0;
> >> +}
> >> +
> >> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret = 0;
> >> +    uint64_t data, data_size;
> >> +
> > I think checking of version_id is still needed.
> > 
> 
> Checking version_id with what value?
>
this version_id passed-in is the source VFIO software interface id.
need to check it with the value in target side, right?

Though we previously discussed the sysfs node interface to check live
migration version even before launching live migration, I think we still
need this runtime software version check in qemu to ensure software
interfaces in QEMU VFIO are compatible.

Thanks
Yan


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM
  2019-07-11 19:14     ` Kirti Wankhede
@ 2019-07-22  8:23       ` Yan Zhao
  2019-08-20 20:31         ` Kirti Wankhede
  0 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-22  8:23 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, Dr. David Alan Gilbert, Wang,
	Zhi A, mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Fri, Jul 12, 2019 at 03:14:03AM +0800, Kirti Wankhede wrote:
> 
> 
> On 7/11/2019 5:43 PM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >> VM state change handler gets called on change in VM's state. This is used to set
> >> VFIO device state to _RUNNING.
> >> VM state change handler, migration state change handler and log_sync listener
> >> are called asynchronously, which sometimes lead to data corruption in migration
> >> region. Initialised mutex that is used to serialize operations on migration data
> >> region during saving state.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/migration.c           | 64 +++++++++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/trace-events          |  2 ++
> >>  include/hw/vfio/vfio-common.h |  4 +++
> >>  3 files changed, 70 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index a2cfbd5af2e1..c01f08b659d0 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -78,6 +78,60 @@ err:
> >>      return ret;
> >>  }
> >>  
> >> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region.buffer;
> >> +    uint32_t device_state;
> >> +    int ret = 0;
> >> +
> >> +    device_state = (state & VFIO_DEVICE_STATE_MASK) |
> >> +                   (vbasedev->device_state & ~VFIO_DEVICE_STATE_MASK);
> >> +
> >> +    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
> >> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                              device_state));
> >> +    if (ret < 0) {
> >> +        error_report("%s: Failed to set device state %d %s",
> >> +                     vbasedev->name, ret, strerror(errno));
> >> +        return ret;
> >> +    }
> >> +
> >> +    vbasedev->device_state = device_state;
> >> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> >> +    return 0;
> >> +}
> >> +
> >> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    if ((vbasedev->vm_running != running)) {
> >> +        int ret;
> >> +        uint32_t dev_state;
> >> +
> >> +        if (running) {
> >> +            dev_state = VFIO_DEVICE_STATE_RUNNING;
> >> +        } else {
> >> +            dev_state = (vbasedev->device_state & VFIO_DEVICE_STATE_MASK) &
> >> +                     ~VFIO_DEVICE_STATE_RUNNING;
> >> +        }
> >> +
> >> +        ret = vfio_migration_set_state(vbasedev, dev_state);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to set device state 0x%x",
> >> +                         vbasedev->name, dev_state);
> >> +        }
> >> +        vbasedev->vm_running = running;
> >> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> >> +                                  dev_state);
> >> +    }
> >> +}
> >> +
> >>  static int vfio_migration_init(VFIODevice *vbasedev,
> >>                                 struct vfio_region_info *info)
> >>  {
> >> @@ -93,6 +147,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >>          return ret;
> >>      }
> >>  
> >> +    qemu_mutex_init(&vbasedev->migration->lock);
> > 
> > Does this and it's friend below belong in this patch?  As far as I can
> > tell you init/deinit the lock here but don't use it which is strange.
> > 
> 
> This lock is used in
> 0009-vfio-Add-save-state-functions-to-SaveVMHandlers.patch and
> 0011-vfio-Add-function-to-get-dirty-page-list.patch
> 
> Hm. I'll move this init/deinit to patch 0009 in next iteration.
> 
> Thanks,
> Kirti
>
This lock is used to protect vfio_save_buffer and read dirty page,
right?
if data subregion and bitmap subregion do not reuse "data_offset" in
vfio_device_migration_info.
It seems that this lock can be avoided.

Thanks
Yan


> 
> > Dave
> > 
> >> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >> +                                                          vbasedev);
> >> +
> >>      return 0;
> >>  }
> >>  
> >> @@ -135,11 +194,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
> >>          return;
> >>      }
> >>  
> >> +    if (vbasedev->vm_state) {
> >> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> >> +    }
> >> +
> >>      if (vbasedev->migration_blocker) {
> >>          migrate_del_blocker(vbasedev->migration_blocker);
> >>          error_free(vbasedev->migration_blocker);
> >>      }
> >>  
> >> +    qemu_mutex_destroy(&vbasedev->migration->lock);
> >>      vfio_migration_region_exit(vbasedev);
> >>      g_free(vbasedev->migration);
> >>  }
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index 191a726a1312..3d15bacd031a 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
> >>  
> >>  # migration.c
> >>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> >> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> >> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 152da3f8d6f3..f6c70db3a9c1 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -29,6 +29,7 @@
> >>  #ifdef CONFIG_LINUX
> >>  #include <linux/vfio.h>
> >>  #endif
> >> +#include "sysemu/sysemu.h"
> >>  
> >>  #define VFIO_MSG_PREFIX "vfio %s: "
> >>  
> >> @@ -124,6 +125,9 @@ typedef struct VFIODevice {
> >>      unsigned int flags;
> >>      VFIOMigration *migration;
> >>      Error *migration_blocker;
> >> +    uint32_t device_state;
> >> +    VMChangeStateEntry *vm_state;
> >> +    int vm_running;
> >>  } VFIODevice;
> >>  
> >>  struct VFIODeviceOps {
> >> -- 
> >> 2.7.0
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 08/13] vfio: Register SaveVMHandlers for VFIO device
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 08/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2019-07-22  8:34   ` Yan Zhao
  2019-08-20 20:33     ` Kirti Wankhede
  0 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-22  8:34 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Jul 09, 2019 at 05:49:15PM +0800, Kirti Wankhede wrote:
> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e4a89a6f9bc7..0597a45fda2d 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -23,6 +23,17 @@
>  #include "pci.h"
>  #include "trace.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -106,6 +117,74 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.buffer.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region.buffer);
> +        qemu_mutex_unlock_iothread();
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.index,
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    if (vbasedev->vm_running) {
> +        ret = vfio_migration_set_state(vbasedev,
> +                         VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
> +        if (ret) {
> +            error_report("%s: Failed to set state RUNNING and SAVING",
> +                         vbasedev->name);
> +            return ret;
> +        }
> +    } else {
hi Kirti
May I know in which condition will this "else" case happen?

Thanks
Yan

> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> +        if (ret) {
> +            error_report("%s: Failed to set state STOP and SAVING",
> +                         vbasedev->name);
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.buffer.mmaps) {
> +        vfio_region_unmap(&migration->region.buffer);
> +    }
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -195,7 +274,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>      }
>  
>      qemu_mutex_init(&vbasedev->migration->lock);
> -
> +    register_savevm_live(vbasedev->dev, "vfio", -1, 1, &savevm_vfio_handlers,
> +                         vbasedev);
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 69503228f20e..4bb43f18f315 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
> +vfio_save_setup(char *name) " (%s)"
> +vfio_save_cleanup(char *name) " (%s)"
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
  2019-07-11 12:13   ` Dr. David Alan Gilbert
  2019-07-16 22:03   ` Alex Williamson
@ 2019-07-22  8:37   ` Yan Zhao
  2019-08-20 20:33     ` Kirti Wankhede
  2 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-22  8:37 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Jul 09, 2019 at 05:49:13PM +0800, Kirti Wankhede wrote:
> VM state change handler gets called on change in VM's state. This is used to set
> VFIO device state to _RUNNING.
> VM state change handler, migration state change handler and log_sync listener
> are called asynchronously, which sometimes lead to data corruption in migration
> region. Initialised mutex that is used to serialize operations on migration data
> region during saving state.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 64 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  2 ++
>  include/hw/vfio/vfio-common.h |  4 +++
>  3 files changed, 70 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index a2cfbd5af2e1..c01f08b659d0 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -78,6 +78,60 @@ err:
>      return ret;
>  }
>  
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint32_t device_state;
> +    int ret = 0;
> +
> +    device_state = (state & VFIO_DEVICE_STATE_MASK) |
> +                   (vbasedev->device_state & ~VFIO_DEVICE_STATE_MASK);
> +
> +    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
> +        return -EINVAL;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("%s: Failed to set device state %d %s",
> +                     vbasedev->name, ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    vbasedev->device_state = device_state;
> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> +    return 0;
> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running)) {
> +        int ret;
> +        uint32_t dev_state;
> +
> +        if (running) {
> +            dev_state = VFIO_DEVICE_STATE_RUNNING;
should be
dev_state |= VFIO_DEVICE_STATE_RUNNING; ?

> +        } else {
> +            dev_state = (vbasedev->device_state & VFIO_DEVICE_STATE_MASK) &
> +                     ~VFIO_DEVICE_STATE_RUNNING;
> +        }
> +
> +        ret = vfio_migration_set_state(vbasedev, dev_state);
> +        if (ret) {
> +            error_report("%s: Failed to set device state 0x%x",
> +                         vbasedev->name, dev_state);
> +        }
> +        vbasedev->vm_running = running;
> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                                  dev_state);
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -93,6 +147,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    qemu_mutex_init(&vbasedev->migration->lock);
> +
> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
> +
>      return 0;
>  }
>  
> @@ -135,11 +194,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>          return;
>      }
>  
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +    }
> +
>      if (vbasedev->migration_blocker) {
>          migrate_del_blocker(vbasedev->migration_blocker);
>          error_free(vbasedev->migration_blocker);
>      }
>  
> +    qemu_mutex_destroy(&vbasedev->migration->lock);
>      vfio_migration_region_exit(vbasedev);
>      g_free(vbasedev->migration);
>  }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 191a726a1312..3d15bacd031a 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 152da3f8d6f3..f6c70db3a9c1 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -29,6 +29,7 @@
>  #ifdef CONFIG_LINUX
>  #include <linux/vfio.h>
>  #endif
> +#include "sysemu/sysemu.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -124,6 +125,9 @@ typedef struct VFIODevice {
>      unsigned int flags;
>      VFIOMigration *migration;
>      Error *migration_blocker;
> +    uint32_t device_state;
> +    VMChangeStateEntry *vm_state;
> +    int vm_running;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list Kirti Wankhede
  2019-07-12  0:33   ` Yan Zhao
@ 2019-07-22  8:39   ` Yan Zhao
  2019-08-20 20:34     ` Kirti Wankhede
  1 sibling, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-22  8:39 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Jul 09, 2019 at 05:49:18PM +0800, Kirti Wankhede wrote:
> Dirty page tracking (.log_sync) is part of RAM copying state, where
> vendor driver provides the bitmap of pages which are dirtied by vendor
> driver through migration region and as part of RAM copy, those pages
> gets copied to file stream.
> 
> To get dirty page bitmap:
> - write start address, page_size and pfn count.
> - read count of pfns copied.
>     - Vendor driver should return 0 if driver doesn't have any page to
>       report dirty in given range.
>     - Vendor driver should return -1 to mark all pages dirty for given range.
> - read data_offset, where vendor driver has written bitmap.
> - read bitmap from the region or mmaped part of the region.
> - Iterate above steps till page bitmap for all requested pfns are copied.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 123 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   1 +
>  include/hw/vfio/vfio-common.h |   2 +
>  3 files changed, 126 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 5fb4c5329ede..ca1a8c0f5f1f 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -269,6 +269,129 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
> +                              uint64_t start_pfn,
> +                              uint64_t pfn_count,
> +                              uint64_t page_size)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region.buffer;
> +    uint64_t count = 0;
> +    int64_t copied_pfns = 0;
> +    int64_t total_pfns = pfn_count;
> +    int ret;
> +
> +    qemu_mutex_lock(&migration->lock);
> +
> +    while (total_pfns > 0) {
> +        uint64_t bitmap_size, data_offset = 0;
> +        uint64_t start = start_pfn + count;
> +        void *buf = NULL;
> +        bool buffer_mmaped = false;
> +
> +        ret = pwrite(vbasedev->fd, &start, sizeof(start),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              start_pfn));
> +        if (ret < 0) {
> +            error_report("%s: Failed to set dirty pages start address %d %s",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto dpl_unlock;
> +        }
> +
> +        ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              page_size));
> +        if (ret < 0) {
> +            error_report("%s: Failed to set dirty page size %d %s",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto dpl_unlock;
> +        }
> +
> +        ret = pwrite(vbasedev->fd, &total_pfns, sizeof(total_pfns),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              total_pfns));
> +        if (ret < 0) {
> +            error_report("%s: Failed to set dirty page total pfns %d %s",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto dpl_unlock;
> +        }
> +
> +        /* Read copied dirty pfns */
> +        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             copied_pfns));
> +        if (ret < 0) {
> +            error_report("%s: Failed to get dirty pages bitmap count %d %s",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto dpl_unlock;
> +        }
> +
> +        if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_NONE) {
> +            /*
> +             * copied_pfns could be 0 if driver doesn't have any page to
> +             * report dirty in given range
> +             */
> +            break;
> +        } else if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_ALL) {
> +            /* Mark all pages dirty for this range */
> +            cpu_physical_memory_set_dirty_range(start_pfn * page_size,
> +                                                pfn_count * page_size,
> +                                                DIRTY_MEMORY_MIGRATION);
> +            break;
> +        }
> +
> +        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
hi Kirti

why bitmap_size is 
(BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long).
why it's not
BITS_TO_LONGS(copied_pfns) * sizeof(unsigned long) ?

Thanks
Yan

> +        ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +        if (ret != sizeof(data_offset)) {
> +            error_report("%s: Failed to get migration buffer data offset %d",
> +                         vbasedev->name, ret);
> +            goto dpl_unlock;
> +        }
> +
> +        if (region->mmaps) {
> +            buf = find_data_region(region, data_offset, bitmap_size);
> +        }
> +
> +        buffer_mmaped = (buf != NULL) ? true : false;
> +
> +        if (!buffer_mmaped) {
> +            buf = g_try_malloc0(bitmap_size);
> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                goto dpl_unlock;
> +            }
> +
> +            ret = pread(vbasedev->fd, buf, bitmap_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != bitmap_size) {
> +                error_report("%s: Failed to get dirty pages bitmap %d",
> +                             vbasedev->name, ret);
> +                g_free(buf);
> +                goto dpl_unlock;
> +            }
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> +                                               (start_pfn + count) * page_size,
> +                                                copied_pfns);
> +        count      += copied_pfns;
> +        total_pfns -= copied_pfns;
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +    }
> +
> +    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
> +                                   page_size);
> +
> +dpl_unlock:
> +    qemu_mutex_unlock(&migration->lock);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ac065b559f4e..414a5e69ec5e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
>  vfio_load_device_config_state(char *name) " (%s)"
>  vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
>  vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a022484d2636..dc1b83a0b4ef 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -222,5 +222,7 @@ int vfio_spapr_remove_window(VFIOContainer *container,
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
>  void vfio_migration_finalize(VFIODevice *vbasedev);
> +void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
> +                               uint64_t pfn_count, uint64_t page_size);
>  
>  #endif /* HW_VFIO_VFIO_COMMON_H */
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 10/13] vfio: Add load state functions to SaveVMHandlers
  2019-07-22  3:20       ` Yan Zhao
@ 2019-07-22 19:07         ` Alex Williamson
  2019-07-22 21:50           ` Yan Zhao
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Williamson @ 2019-07-22 19:07 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	Kirti Wankhede, eauger, cohuck, jonathan.davies, felipe,
	mlevitsk, Liu,  Changpeng, Wang, Zhi A

On Sun, 21 Jul 2019 23:20:28 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, Jul 19, 2019 at 03:00:13AM +0800, Kirti Wankhede wrote:
> > 
> > 
> > On 7/12/2019 8:22 AM, Yan Zhao wrote:  
> > > On Tue, Jul 09, 2019 at 05:49:17PM +0800, Kirti Wankhede wrote:  
> > >> Flow during _RESUMING device state:
> > >> - If Vendor driver defines mappable region, mmap migration region.
> > >> - Load config state.
> > >> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
> > >>     - read data_size from packet, read buffer of data_size
> > >>     - read data_offset from where QEMU should write data.
> > >>         if region is mmaped, write data of data_size to mmaped region.
> > >>     - write data_size.
> > >>         In case of mmapped region, write to data_size indicates kernel
> > >>         driver that data is written in staging buffer.
> > >>     - if region is trapped, pwrite() data of data_size from data_offset.
> > >> - Repeat above until VFIO_MIG_FLAG_END_OF_STATE.
> > >> - Unmap migration region.
> > >>
> > >> For user, data is opaque. User should write data in the same order as
> > >> received.
> > >>
> > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >> ---
> > >>  hw/vfio/migration.c  | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>  hw/vfio/trace-events |   3 +
> > >>  2 files changed, 165 insertions(+)
> > >>
> > >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > >> index 4e9b4cce230b..5fb4c5329ede 100644
> > >> --- a/hw/vfio/migration.c
> > >> +++ b/hw/vfio/migration.c
> > >> @@ -249,6 +249,26 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> > >>      return qemu_file_get_error(f);
> > >>  }
> > >>  
> > >> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> > >> +{
> > >> +    VFIODevice *vbasedev = opaque;
> > >> +    uint64_t data;
> > >> +
> > >> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> > >> +        vbasedev->ops->vfio_load_config(vbasedev, f);
> > >> +    }
> > >> +
> > >> +    data = qemu_get_be64(f);
> > >> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> > >> +        error_report("%s: Failed loading device config space, "
> > >> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> > >> +        return -EINVAL;
> > >> +    }
> > >> +
> > >> +    trace_vfio_load_device_config_state(vbasedev->name);
> > >> +    return qemu_file_get_error(f);
> > >> +}
> > >> +
> > >>  /* ---------------------------------------------------------------------- */
> > >>  
> > >>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> > >> @@ -421,12 +441,154 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > >>      return ret;
> > >>  }
> > >>  
> > >> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > >> +{
> > >> +    VFIODevice *vbasedev = opaque;
> > >> +    VFIOMigration *migration = vbasedev->migration;
> > >> +    int ret = 0;
> > >> +
> > >> +    if (migration->region.buffer.mmaps) {
> > >> +        ret = vfio_region_mmap(&migration->region.buffer);
> > >> +        if (ret) {
> > >> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> > >> +                         vbasedev->name, migration->region.index,
> > >> +                         strerror(-ret));
> > >> +            return ret;
> > >> +        }
> > >> +    }
> > >> +
> > >> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING);
> > >> +    if (ret) {
> > >> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> > >> +    }
> > >> +    return ret;
> > >> +}
> > >> +
> > >> +static int vfio_load_cleanup(void *opaque)
> > >> +{
> > >> +    vfio_save_cleanup(opaque);
> > >> +    return 0;
> > >> +}
> > >> +
> > >> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > >> +{
> > >> +    VFIODevice *vbasedev = opaque;
> > >> +    VFIOMigration *migration = vbasedev->migration;
> > >> +    int ret = 0;
> > >> +    uint64_t data, data_size;
> > >> +  
> > > I think checking of version_id is still needed.
> > >   
> > 
> > Checking version_id with what value?
> >  
> this version_id passed-in is the source VFIO software interface id.
> need to check it with the value in target side, right?
> 
> Though we previously discussed the sysfs node interface to check live
> migration version even before launching live migration, I think we still
> need this runtime software version check in qemu to ensure software
> interfaces in QEMU VFIO are compatible.

Do we want QEMU to interact directly with sysfs for that, which would
require write privileges to sysfs, or do we want to suggest that vendor
drivers should include equivalent information early in their migration
data stream to force a migration failure as early as possible for
incompatible data?  I think we need the latter regardless because the
vendor driver should never trust userspace like that, but does that
make any QEMU use of the sysfs version test itself redundant?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 10/13] vfio: Add load state functions to SaveVMHandlers
  2019-07-22 19:07         ` Alex Williamson
@ 2019-07-22 21:50           ` Yan Zhao
  2019-08-20 20:35             ` Kirti Wankhede
  0 siblings, 1 reply; 77+ messages in thread
From: Yan Zhao @ 2019-07-22 21:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	Kirti Wankhede, eauger, cohuck, jonathan.davies, felipe,
	mlevitsk, Liu, Changpeng, Wang, Zhi A

On Tue, Jul 23, 2019 at 03:07:13AM +0800, Alex Williamson wrote:
> On Sun, 21 Jul 2019 23:20:28 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Jul 19, 2019 at 03:00:13AM +0800, Kirti Wankhede wrote:
> > > 
> > > 
> > > On 7/12/2019 8:22 AM, Yan Zhao wrote:  
> > > > On Tue, Jul 09, 2019 at 05:49:17PM +0800, Kirti Wankhede wrote:  
> > > >> Flow during _RESUMING device state:
> > > >> - If Vendor driver defines mappable region, mmap migration region.
> > > >> - Load config state.
> > > >> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
> > > >>     - read data_size from packet, read buffer of data_size
> > > >>     - read data_offset from where QEMU should write data.
> > > >>         if region is mmaped, write data of data_size to mmaped region.
> > > >>     - write data_size.
> > > >>         In case of mmapped region, write to data_size indicates kernel
> > > >>         driver that data is written in staging buffer.
> > > >>     - if region is trapped, pwrite() data of data_size from data_offset.
> > > >> - Repeat above until VFIO_MIG_FLAG_END_OF_STATE.
> > > >> - Unmap migration region.
> > > >>
> > > >> For user, data is opaque. User should write data in the same order as
> > > >> received.
> > > >>
> > > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > >> ---
> > > >>  hw/vfio/migration.c  | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >>  hw/vfio/trace-events |   3 +
> > > >>  2 files changed, 165 insertions(+)
> > > >>
> > > >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > >> index 4e9b4cce230b..5fb4c5329ede 100644
> > > >> --- a/hw/vfio/migration.c
> > > >> +++ b/hw/vfio/migration.c
> > > >> @@ -249,6 +249,26 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> > > >>      return qemu_file_get_error(f);
> > > >>  }
> > > >>  
> > > >> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> > > >> +{
> > > >> +    VFIODevice *vbasedev = opaque;
> > > >> +    uint64_t data;
> > > >> +
> > > >> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> > > >> +        vbasedev->ops->vfio_load_config(vbasedev, f);
> > > >> +    }
> > > >> +
> > > >> +    data = qemu_get_be64(f);
> > > >> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> > > >> +        error_report("%s: Failed loading device config space, "
> > > >> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> > > >> +        return -EINVAL;
> > > >> +    }
> > > >> +
> > > >> +    trace_vfio_load_device_config_state(vbasedev->name);
> > > >> +    return qemu_file_get_error(f);
> > > >> +}
> > > >> +
> > > >>  /* ---------------------------------------------------------------------- */
> > > >>  
> > > >>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> > > >> @@ -421,12 +441,154 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > > >>      return ret;
> > > >>  }
> > > >>  
> > > >> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > > >> +{
> > > >> +    VFIODevice *vbasedev = opaque;
> > > >> +    VFIOMigration *migration = vbasedev->migration;
> > > >> +    int ret = 0;
> > > >> +
> > > >> +    if (migration->region.buffer.mmaps) {
> > > >> +        ret = vfio_region_mmap(&migration->region.buffer);
> > > >> +        if (ret) {
> > > >> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> > > >> +                         vbasedev->name, migration->region.index,
> > > >> +                         strerror(-ret));
> > > >> +            return ret;
> > > >> +        }
> > > >> +    }
> > > >> +
> > > >> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING);
> > > >> +    if (ret) {
> > > >> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> > > >> +    }
> > > >> +    return ret;
> > > >> +}
> > > >> +
> > > >> +static int vfio_load_cleanup(void *opaque)
> > > >> +{
> > > >> +    vfio_save_cleanup(opaque);
> > > >> +    return 0;
> > > >> +}
> > > >> +
> > > >> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > > >> +{
> > > >> +    VFIODevice *vbasedev = opaque;
> > > >> +    VFIOMigration *migration = vbasedev->migration;
> > > >> +    int ret = 0;
> > > >> +    uint64_t data, data_size;
> > > >> +  
> > > > I think checking of version_id is still needed.
> > > >   
> > > 
> > > Checking version_id with what value?
> > >  
> > this version_id passed-in is the source VFIO software interface id.
> > need to check it with the value in target side, right?
> > 
> > Though we previously discussed the sysfs node interface to check live
> > migration version even before launching live migration, I think we still
> > need this runtime software version check in qemu to ensure software
> > interfaces in QEMU VFIO are compatible.
> 
> Do we want QEMU to interact directly with sysfs for that, which would
> require write privileges to sysfs, or do we want to suggest that vendor
> drivers should include equivalent information early in their migration
> data stream to force a migration failure as early as possible for
> incompatible data?  I think we need the latter regardless because the
> vendor driver should never trust userspace like that, but does that
> make any QEMU use of the sysfs version test itself redundant?  Thanks,
> 
> Alex

hi Alex
I think QEMU needs to check at least the code version of software interface in
QEMU, like format of migration region, details of migration protocol,
IOW, the software version QEMU interacts with vendor driver.
This information should not be known to vendor driver until migration
running to certain phase.
e.g. if saving flow or format in source qemu is changed a little as a result
of software upgrading, target qemu has to detect that from this
version_id check, as vendor driver has no knowledge of that.
Does that make sense?


Thanks
Yan


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface
  2019-07-16 20:56   ` Alex Williamson
  2019-07-17 11:55     ` Cornelia Huck
@ 2019-07-23 12:13     ` Cornelia Huck
  2019-08-21 20:32       ` Kirti Wankhede
  2019-08-21 20:31     ` Kirti Wankhede
  2 siblings, 1 reply; 77+ messages in thread
From: Cornelia Huck @ 2019-07-23 12:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, qemu-devel,
	Zhengxiao.zx, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, Kirti Wankhede, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 16 Jul 2019 14:56:32 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 9 Jul 2019 15:19:08 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:

I'm still a bit unsure about the device_state bit handling as well.

> > + * device_state: (read/write)
> > + *      To indicate vendor driver the state VFIO device should be transitioned
> > + *      to. If device state transition fails, write on this field return error.

Does 'device state transition fails' include 'the device state written
was invalid'?

> > + *      It consists of 3 bits:
> > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> > + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> > + *        device before write() returns.

So _STOPPED is always !_RUNNING, regardless of which other bits are set?

> > + *      - If bit 1 set, indicates _SAVING state.
> > + *      - If bit 2 set, indicates _RESUMING state.
> > + *      _SAVING and _RESUMING set at the same time is invalid state.  

What about _RUNNING | _RESUMING -- does that make sense?

> 
> I think in the previous version there was a question of how we handle
> yet-to-be-defined bits.  For instance, if we defined a
> SUBTYPE_MIGRATIONv2 with the intention of making it backwards
> compatible with this version, do we declare the undefined bits as
> preserved so that the user should do a read-modify-write operation?

Or can we state that undefined bits are ignored, and may or may not
preserved, so that we can skip the read-modify-write requirement? v1
and v2 can hopefully be distinguished in a different way.

(...)

> > +struct vfio_device_migration_info {
> > +        __u32 device_state;         /* VFIO device state */
> > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> > +                                     VFIO_DEVICE_STATE_SAVING | \
> > +                                     VFIO_DEVICE_STATE_RESUMING)  
> 
> Yes, we have the mask in here now, but no mention above how the user
> should handle undefined bits.  Thanks,
> 
> Alex
> 
> > +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> > +                                     VFIO_DEVICE_STATE_RESUMING)

As mentioned above, does _RESUMING | _RUNNING make sense?


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 05/13] vfio: Add migration region initialization and finalize function
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
  2019-07-16 21:37   ` Alex Williamson
@ 2019-07-23 12:52   ` Cornelia Huck
  1 sibling, 0 replies; 77+ messages in thread
From: Cornelia Huck @ 2019-07-23 12:52 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, qemu-devel,
	Zhengxiao.zx, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 9 Jul 2019 15:19:12 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
>   patch series.
> - VFIO device supports migration or not is decided based of migration region
>   query. If migration region query is successful and migration region
>   initialization is successful then migration is supported else migration is
>   blocked.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 145 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   3 +
>  include/hw/vfio/vfio-common.h |  14 ++++
>  4 files changed, 163 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
(...)
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..a2cfbd5af2e1
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,145 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2019
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +#include "trace.h"
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.buffer.size) {
> +        vfio_region_exit(&migration->region.buffer);
> +        vfio_region_finalize(&migration->region.buffer);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!migration) {

You're checking for vbasedev->migration here...

> +        return ret;
> +    }
> +
> +    if (!vbasedev->ops || !vbasedev->ops->vfio_get_object) {
> +        return ret;
> +    }
> +
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
> +                            migration->region.index, "migration");
> +    if (ret) {
> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
> +                     vbasedev->name, migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.buffer.size) {
> +        ret = -EINVAL;
> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
> +                     vbasedev->name, migration->region.index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_init(VFIODevice *vbasedev,
> +                               struct vfio_region_info *info)
> +{
> +    int ret;
> +
> +    vbasedev->migration = g_new0(VFIOMigration, 1);

...but always allocate it before calling the function above here. What
am I missing?

> +    vbasedev->migration->region.index = info->index;
> +
> +    ret = vfio_migration_region_init(vbasedev);
> +    if (ret) {
> +        error_report("%s: Failed to initialise migration region",
> +                     vbasedev->name);
> +        return ret;

It feels a bit odd that you don't free ->migration again here, but
delay it until finalize.

> +    }
> +
> +    return 0;
> +}
> +
> +/* ---------------------------------------------------------------------- */
> +
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> +{
> +    struct vfio_region_info *info;
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    if (ret) {
> +        goto add_blocker;

So you don't even call init if the region is not present (which seems
reasonable)...

> +    }
> +
> +    ret = vfio_migration_init(vbasedev, info);
> +    if (ret) {
> +        goto add_blocker;
> +    }
> +
> +    trace_vfio_migration_probe(vbasedev->name, info->index);
> +    return 0;
> +
> +add_blocker:
> +    error_setg(&vbasedev->migration_blocker,
> +               "VFIO device doesn't support migration");
> +    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +    return ret;
> +}
> +
> +void vfio_migration_finalize(VFIODevice *vbasedev)
> +{
> +    if (!vbasedev->migration) {

...but you're doing a quick exit here in that case. Shouldn't you get
rid of the blocker here?

> +        return;
> +    }
> +
> +    if (vbasedev->migration_blocker) {
> +        migrate_del_blocker(vbasedev->migration_blocker);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +
> +    vfio_migration_region_exit(vbasedev);
> +    g_free(vbasedev->migration);
> +}


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 12/13] vfio: Add vfio_listerner_log_sync to mark dirty pages
  2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 12/13] vfio: Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
@ 2019-07-23 13:18   ` Cornelia Huck
  0 siblings, 0 replies; 77+ messages in thread
From: Cornelia Huck @ 2019-07-23 13:18 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, qemu-devel,
	Zhengxiao.zx, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 9 Jul 2019 15:19:19 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> vfio_listerner_log_sync gets list of dirty pages from vendor driver and mark

s/listerner/listener/ (here and in the code)

> those pages dirty when in _SAVING state.
> Return early for the RAM block section of mapped MMIO region.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c | 35 +++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device
  2019-07-19  1:23               ` Yan Zhao
@ 2019-07-24 11:32                 ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-24 11:32 UTC (permalink / raw)
  To: Yan Zhao
  Cc: alex.williamson, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang,
	Ziye, Ken.Xue, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	qemu-devel, mlevitsk, pasic, aik, Kirti Wankhede, eauger, cohuck,
	jonathan.davies, felipe, Liu, Changpeng, Wang, Zhi A

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> On Fri, Jul 19, 2019 at 02:32:33AM +0800, Kirti Wankhede wrote:
> > 
> > On 7/12/2019 6:02 AM, Yan Zhao wrote:
> > > On Fri, Jul 12, 2019 at 03:08:31AM +0800, Kirti Wankhede wrote:
> > >>
> > >>
> > >> On 7/11/2019 9:53 PM, Dr. David Alan Gilbert wrote:
> > >>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > >>>> On Thu, Jul 11, 2019 at 06:50:12PM +0800, Dr. David Alan Gilbert wrote:
> > >>>>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > >>>>>> Hi Kirti,
> > >>>>>> There are still unaddressed comments to your patches v4.
> > >>>>>> Would you mind addressing them?
> > >>>>>>
> > >>>>>> 1. should we register two migration interfaces simultaneously
> > >>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04750.html)
> > >>>>>
> > >>>>> Please don't do this.
> > >>>>> As far as I'm aware we currently only have one device that does that
> > >>>>> (vmxnet3) and a patch has just been posted that fixes/removes that.
> > >>>>>
> > >>>>> Dave
> > >>>>>
> > >>>> hi Dave,
> > >>>> Thanks for notifying this. but if we want to support postcopy in future,
> > >>>> after device stops, what interface could we use to transfer data of
> > >>>> device state only?
> > >>>> for postcopy, when source device stops, we need to transfer only
> > >>>> necessary device state to target vm before target vm starts, and we
> > >>>> don't want to transfer device memory as we'll do that after target vm
> > >>>> resuming.
> > >>>
> > >>> Hmm ok, lets see; that's got to happen in the call to:
> > >>>     qemu_savevm_state_complete_precopy(fb, false, false);
> > >>> that's made from postcopy_start.
> > >>>  (the false's are iterable_only and inactivate_disks)
> > >>>
> > >>> and at that time I believe the state is POSTCOPY_ACTIVE, so in_postcopy
> > >>> is true.
> > >>>
> > >>> If you're doing postcopy, then you'll probably define a has_postcopy()
> > >>> function, so qemu_savevm_state_complete_precopy will skip the
> > >>> save_live_complete_precopy call from it's loop for at least two of the
> > >>> reasons in it's big if.
> > >>>
> > >>> So you're right; you need the VMSD for this to happen in the second
> > >>> loop in qemu_savevm_state_complete_precopy.  Hmm.
> > >>>
> > >>> Now, what worries me, and I don't know the answer, is how the section
> > >>> header for the vmstate and the section header for an iteration look
> > >>> on the stream; how are they different?
> > >>>
> > >>
> > >> I don't have way to test postcopy migration - is one of the major reason
> > >> I had not included postcopy support in this patchset and clearly called
> > >> out in cover letter.
> > >> This patchset is thoroughly tested for precopy migration.
> > >> If anyone have hardware that supports fault, then I would prefer to add
> > >> postcopy support as incremental change later which can be tested before
> > >> submitting.
> > >>
> > >> Just a suggestion, instead of using VMSD, is it possible to have some
> > >> additional check to call save_live_complete_precopy from
> > >> qemu_savevm_state_complete_precopy?
> > >>
> > >>
> > >>>>
> > >>>>>> 2. in each save iteration, how much data is to be saved
> > >>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04683.html)
> > >>
> > >>> how big is the data_size ?
> > >>> if this size is too big, it may take too much time and block others.
> > >>
> > >> I do had mentioned this in the comment about the structure in vfio.h
> > >> header. data_size will be provided by vendor driver and obviously will
> > >> not be greater that migration region size. Vendor driver should be
> > >> responsible to keep its solution optimized.
> > >>
> > > if the data_size is no big than migration region size, and each
> > > iteration only saves data of data_size, i'm afraid it will cause
> > > prolonged down time. after all, the migration region size cannot be very
> > > big.
> > 
> > As I mentioned above, its vendor driver responsibility to keep its
> > solution optimized.
> > A good behaving vendor driver should not cause unnecessary prolonged
> > down time.
> >
> I think vendor data can determine the data_size, but qemu has to decide
> how much data to transmit according to the actual transmitting time (or
> transmitting speed). when transmitting speed is high, transmit more data
> a iteration, and low speed small data a iteration. This transmitting
> time knowledge is not available to vendor driver, and even it has this
> knowledge, can it dynamically change data region size?
> maybe you can say vendor driver can register a big enough region and
> dynamically use part of it, but what is big enough and it really costs
> memory.
> 
> > > Also, if vendor driver determines how much data to save in each
> > > iteration alone, and no checks in qemu, it may cause other devices'
> > > migration time be squeezed.
> > > 
> > 
> > Sorry, how will that squeeze other device's migration time?
> > 
> if a vendor driver has extremely big data to transmit, other devices
> have to wait until it finishes to transmit their own data. In a given
> time period, it can be considered as their time slot being squeezed.
> not sure whether it's a good practice.

I don't think we have anything for fairly allocating bandwidth/time
among devices; that would be pretty hard.

I think the only thing we currently have is the final downtime
heuristics; i.e. the threshold you should be below at the point
we decide to stop iterating.

Dave

> > >>
> > >>>>>> 3. do we need extra interface to get data for device state only
> > >>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04812.html)
> > >>
> > >> I don't think so. Opaque Device data from vendor driver can include
> > >> device state and device memory. Vendor driver who is managing his device
> > >> can decide how to place data over the stream.
> > >>
> > > I know current design is opaque device data. then to support postcopy,
> > > we may have to add extra device state like in-postcopy. but postcopy is
> > > more like a qemu state and is not a device state.
> > 
> > One bit from device_state can be used to inform vendor driver about
> > postcopy, when postcopy support will be added.
> >
> ok. if you insist on that, one bit in devie_state is also good, as long
> as everyone agrees on it:)
> 
> > > to address it elegantly, may we add an extra interface besides
> > > vfio_save_buffer() to get data for device state only?
> > > 
> > 
> > When in postcopy state, based on device_state flag, vendor driver can
> > copy device state first in migration region, I still think there is no
> > need to separate device state and device memory.
> >
> it's the difference between more device_state flags and more interfaces.
> 
> > >>>>>> 4. definition of dirty page copied_pfn
> > >>>>>> (https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05592.html)
> > >>>>>>
> > >>
> > >> This was inline to discussion going with Alex. I addressed the concern
> > >> there. Please check current patchset, which addresses the concerns raised.
> > >>
> > > ok. I saw you also updated the flow in the part. please check my comment
> > > in that patch for detail. but as a suggestion, I think processed_pfns is
> > > a better name compared to copied_pfns :)
> > > 
> > 
> > Vendor driver can process total_pfns, but can copy only some pfns bitmap
> > to migration region. One of the reason could be the size of migration
> > region is not able to accommodate bitmap of total_pfns size. So it could
> > be like: 0 < copied_pfns < total_pfns. That's why the name
> > 'copied_pfns'. I'll continue with 'copied_pfns'.
> >
> so it's bitmap's pfn count, right? 
> besides VFIO_DEVICE_DIRTY_PFNS_NONE, and VFIO_DEVICE_DIRTY_PFNS_ALL to
> indicate no dirty and all dirty in total_pfns, why not add two more
> flags, e.g.
> VFIO_DEVICE_DIRTY_PFNS_ONE_ITERATION,
> VFIO_DEVICE_DIRTY_PFNS_ALL_ONE_ITERATION, to skip copying bitmap from
> kernel if the copied bitmap in one iteration is all 0 or all 1?
> 
> Thanks
> Yan
> > 
> > >>>>>> Also, I'm glad to see that you updated code by following my comments below,
> > >>>>>> but please don't forget to reply my comments next time:)
> > >>
> > >> I tried to reply top of threads and addressed common concerns raised in
> > >> that. Sorry If I missed any, I'll make sure to point you to my replies
> > >> going ahead.
> > >>
> > > ok. let's cooperate:)
> > > 
> > > Thanks
> > > Yan
> > > 
> > >> Thanks,
> > >> Kirti
> > >>
> > >>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05357.html
> > >>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg06454.html
> > >>>>>>
> > >>>>>> Thanks
> > >>>>>> Yan
> > >>>>>>
> > >>>>>> On Tue, Jul 09, 2019 at 05:49:07PM +0800, Kirti Wankhede wrote:
> > >>>>>>> Add migration support for VFIO device
> > >>>>>>>
> > >>>>>>> This Patch set include patches as below:
> > >>>>>>> - Define KABI for VFIO device for migration support.
> > >>>>>>> - Added save and restore functions for PCI configuration space
> > >>>>>>> - Generic migration functionality for VFIO device.
> > >>>>>>>   * This patch set adds functionality only for PCI devices, but can be
> > >>>>>>>     extended to other VFIO devices.
> > >>>>>>>   * Added all the basic functions required for pre-copy, stop-and-copy and
> > >>>>>>>     resume phases of migration.
> > >>>>>>>   * Added state change notifier and from that notifier function, VFIO
> > >>>>>>>     device's state changed is conveyed to VFIO device driver.
> > >>>>>>>   * During save setup phase and resume/load setup phase, migration region
> > >>>>>>>     is queried and is used to read/write VFIO device data.
> > >>>>>>>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
> > >>>>>>>     functionality of iteration during pre-copy phase.
> > >>>>>>>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
> > >>>>>>>     iteration to read data from VFIO device driver is implemented till pending
> > >>>>>>>     bytes returned by driver are not zero.
> > >>>>>>>   * Added function to get dirty pages bitmap for the pages which are used by
> > >>>>>>>     driver.
> > >>>>>>> - Add vfio_listerner_log_sync to mark dirty pages.
> > >>>>>>> - Make VFIO PCI device migration capable. If migration region is not provided by
> > >>>>>>>   driver, migration is blocked.
> > >>>>>>>
> > >>>>>>> Below is the flow of state change for live migration where states in brackets
> > >>>>>>> represent VM state, migration state and VFIO device state as:
> > >>>>>>>     (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> > >>>>>>>
> > >>>>>>> Live migration save path:
> > >>>>>>>         QEMU normal running state
> > >>>>>>>         (RUNNING, _NONE, _RUNNING)
> > >>>>>>>                         |
> > >>>>>>>     migrate_init spawns migration_thread.
> > >>>>>>>     (RUNNING, _SETUP, _RUNNING|_SAVING)
> > >>>>>>>     Migration thread then calls each device's .save_setup()
> > >>>>>>>                         |
> > >>>>>>>     (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> > >>>>>>>     If device is active, get pending bytes by .save_live_pending()
> > >>>>>>>     if pending bytes >= threshold_size,  call save_live_iterate()
> > >>>>>>>     Data of VFIO device for pre-copy phase is copied.
> > >>>>>>>     Iterate till pending bytes converge and are less than threshold
> > >>>>>>>                         |
> > >>>>>>>     On migration completion, vCPUs stops and calls .save_live_complete_precopy
> > >>>>>>>     for each active device. VFIO device is then transitioned in
> > >>>>>>>      _SAVING state.
> > >>>>>>>     (FINISH_MIGRATE, _DEVICE, _SAVING)
> > >>>>>>>     For VFIO device, iterate in  .save_live_complete_precopy  until
> > >>>>>>>     pending data is 0.
> > >>>>>>>     (FINISH_MIGRATE, _DEVICE, _STOPPED)
> > >>>>>>>                         |
> > >>>>>>>     (FINISH_MIGRATE, _COMPLETED, STOPPED)
> > >>>>>>>     Migraton thread schedule cleanup bottom half and exit
> > >>>>>>>
> > >>>>>>> Live migration resume path:
> > >>>>>>>     Incomming migration calls .load_setup for each device
> > >>>>>>>     (RESTORE_VM, _ACTIVE, STOPPED)
> > >>>>>>>                         |
> > >>>>>>>     For each device, .load_state is called for that device section data
> > >>>>>>>                         |
> > >>>>>>>     At the end, called .load_cleanup for each device and vCPUs are started.
> > >>>>>>>                         |
> > >>>>>>>         (RUNNING, _NONE, _RUNNING)
> > >>>>>>>
> > >>>>>>> Note that:
> > >>>>>>> - Migration post copy is not supported.
> > >>>>>>>
> > >>>>>>> v6 -> v7:
> > >>>>>>> - Fix build failures.
> > >>>>>>>
> > >>>>>>> v5 -> v6:
> > >>>>>>> - Fix build failure.
> > >>>>>>>
> > >>>>>>> v4 -> v5:
> > >>>>>>> - Added decriptive comment about the sequence of access of members of structure
> > >>>>>>>   vfio_device_migration_info to be followed based on Alex's suggestion
> > >>>>>>> - Updated get dirty pages sequence.
> > >>>>>>> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
> > >>>>>>>   get_object, save_config and load_config.
> > >>>>>>> - Fixed multiple nit picks.
> > >>>>>>> - Tested live migration with multiple vfio device assigned to a VM.
> > >>>>>>>
> > >>>>>>> v3 -> v4:
> > >>>>>>> - Added one more bit for _RESUMING flag to be set explicitly.
> > >>>>>>> - data_offset field is read-only for user space application.
> > >>>>>>> - data_size is read for every iteration before reading data from migration, that
> > >>>>>>>   is removed assumption that data will be till end of migration region.
> > >>>>>>> - If vendor driver supports mappable sparsed region, map those region during
> > >>>>>>>   setup state of save/load, similarly unmap those from cleanup routines.
> > >>>>>>> - Handles race condition that causes data corruption in migration region during
> > >>>>>>>   save device state by adding mutex and serialiaing save_buffer and
> > >>>>>>>   get_dirty_pages routines.
> > >>>>>>> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> > >>>>>>> - Added trace events.
> > >>>>>>> - Splitted into multiple functional patches.
> > >>>>>>>
> > >>>>>>> v2 -> v3:
> > >>>>>>> - Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
> > >>>>>>> - Re-structured vfio_device_migration_info to keep it minimal and defined action
> > >>>>>>>   on read and write access on its members.
> > >>>>>>>
> > >>>>>>> v1 -> v2:
> > >>>>>>> - Defined MIGRATION region type and sub-type which should be used with region
> > >>>>>>>   type capability.
> > >>>>>>> - Re-structured vfio_device_migration_info. This structure will be placed at 0th
> > >>>>>>>   offset of migration region.
> > >>>>>>> - Replaced ioctl with read/write for trapped part of migration region.
> > >>>>>>> - Added both type of access support, trapped or mmapped, for data section of the
> > >>>>>>>   region.
> > >>>>>>> - Moved PCI device functions to pci file.
> > >>>>>>> - Added iteration to get dirty page bitmap until bitmap for all requested pages
> > >>>>>>>   are copied.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Kirti
> > >>>>>>>
> > >>>>>>> Kirti Wankhede (13):
> > >>>>>>>   vfio: KABI for migration interface
> > >>>>>>>   vfio: Add function to unmap VFIO region
> > >>>>>>>   vfio: Add vfio_get_object callback to VFIODeviceOps
> > >>>>>>>   vfio: Add save and load functions for VFIO PCI devices
> > >>>>>>>   vfio: Add migration region initialization and finalize function
> > >>>>>>>   vfio: Add VM state change handler to know state of VM
> > >>>>>>>   vfio: Add migration state change notifier
> > >>>>>>>   vfio: Register SaveVMHandlers for VFIO device
> > >>>>>>>   vfio: Add save state functions to SaveVMHandlers
> > >>>>>>>   vfio: Add load state functions to SaveVMHandlers
> > >>>>>>>   vfio: Add function to get dirty page list
> > >>>>>>>   vfio: Add vfio_listerner_log_sync to mark dirty pages
> > >>>>>>>   vfio: Make vfio-pci device migration capable.
> > >>>>>>>
> > >>>>>>>  hw/vfio/Makefile.objs         |   2 +-
> > >>>>>>>  hw/vfio/common.c              |  55 +++
> > >>>>>>>  hw/vfio/migration.c           | 874 ++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>  hw/vfio/pci.c                 | 137 ++++++-
> > >>>>>>>  hw/vfio/trace-events          |  19 +
> > >>>>>>>  include/hw/vfio/vfio-common.h |  25 ++
> > >>>>>>>  linux-headers/linux/vfio.h    | 166 ++++++++
> > >>>>>>>  7 files changed, 1271 insertions(+), 7 deletions(-)
> > >>>>>>>  create mode 100644 hw/vfio/migration.c
> > >>>>>>>
> > >>>>>>> -- 
> > >>>>>>> 2.7.0
> > >>>>>>>
> > >>>>> --
> > >>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > >>> --
> > >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > >>>
> > > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 07/13] vfio: Add migration state change notifier
  2019-07-17  2:25   ` Yan Zhao
@ 2019-08-20 20:24     ` Kirti Wankhede
  2019-08-23  0:54       ` Yan Zhao
  0 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-20 20:24 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 7/17/2019 7:55 AM, Yan Zhao wrote:
> On Tue, Jul 09, 2019 at 05:49:14PM +0800, Kirti Wankhede wrote:
>> Added migration state change notifier to get notification on migration state
>> change. These states are translated to VFIO device state and conveyed to vendor
>> driver.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c           | 54 +++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |  1 +
>>  include/hw/vfio/vfio-common.h |  1 +
>>  3 files changed, 56 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index c01f08b659d0..e4a89a6f9bc7 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -132,6 +132,53 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>      }
>>  }
>>  
>> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>> +{
>> +    MigrationState *s = data;
>> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
>> +    int ret;
>> +
>> +    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
>> +
>> +    switch (s->state) {
>> +    case MIGRATION_STATUS_ACTIVE:
>> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
>> +            if (vbasedev->vm_running) {
>> +                ret = vfio_migration_set_state(vbasedev,
>> +                          VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
>> +                if (ret) {
>> +                    error_report("%s: Failed to set state RUNNING and SAVING",
>> +                                  vbasedev->name);
>> +                }
>> +            } else {
>> +                ret = vfio_migration_set_state(vbasedev,
>> +                                               VFIO_DEVICE_STATE_SAVING);
>> +                if (ret) {
>> +                    error_report("%s: Failed to set state STOP and SAVING",
>> +                                 vbasedev->name);
>> +                }
>> +            }
>> +        } else {
>> +            ret = vfio_migration_set_state(vbasedev,
>> +                                           VFIO_DEVICE_STATE_RESUMING);
>> +            if (ret) {
>> +                error_report("%s: Failed to set state RESUMING",
>> +                             vbasedev->name);
>> +            }
>> +        }
>> +        return;
>> +
> hi Kirti
> currently, migration state notifiers are only notified in below 3 interfaces:
> migrate_fd_connect, migrate_fd_cleanup, postcopy_start, where
> MIGRATION_STATUS_ACTIVE is not an valid state.
> Have you tested the above code? what's the purpose of the code?
> 

Sorry for delayed response.

migration_iteration_finish() -> qemu_bh_schedule(s->cleanup_bh) which is
migrate_fd_cleanup().

migration_iteration_finish() can be called with MIGRATION_STATUS_ACTIVE
state. So migration state notifiers can be called with
MIGRATION_STATUS_ACTIVE. So handled that case here.


Thanks,
Kirti


> Thanks
> Yan
> 
>> +    case MIGRATION_STATUS_CANCELLING:
>> +    case MIGRATION_STATUS_CANCELLED:
>> +    case MIGRATION_STATUS_FAILED:
>> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
>> +        if (ret) {
>> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
>> +        }
>> +        return;
>> +    }
>> +}
>> +
>>  static int vfio_migration_init(VFIODevice *vbasedev,
>>                                 struct vfio_region_info *info)
>>  {
>> @@ -152,6 +199,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                            vbasedev);
>>  
>> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
>> +    add_migration_state_change_notifier(&vbasedev->migration_state);
>> +
>>      return 0;
>>  }
>>  
>> @@ -194,6 +244,10 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>>          return;
>>      }
>>  
>> +    if (vbasedev->migration_state.notify) {
>> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
>> +    }
>> +
>>      if (vbasedev->vm_state) {
>>          qemu_del_vm_change_state_handler(vbasedev->vm_state);
>>      }
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 3d15bacd031a..69503228f20e 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -148,3 +148,4 @@ vfio_display_edid_write_error(void) ""
>>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>>  vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>>  vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>> +vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index f6c70db3a9c1..a022484d2636 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -128,6 +128,7 @@ typedef struct VFIODevice {
>>      uint32_t device_state;
>>      VMChangeStateEntry *vm_state;
>>      int vm_running;
>> +    Notifier migration_state;
>>  } VFIODevice;
>>  
>>  struct VFIODeviceOps {
>> -- 
>> 2.7.0
>>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers
  2019-07-17  2:50   ` Yan Zhao
@ 2019-08-20 20:30     ` Kirti Wankhede
  0 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-20 20:30 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 7/17/2019 8:20 AM, Yan Zhao wrote:
> On Tue, Jul 09, 2019 at 05:49:16PM +0800, Kirti Wankhede wrote:
>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>> functions. These functions handles pre-copy and stop-and-copy phase.
>>
>> In _SAVING|_RUNNING device state or pre-copy phase:
>> - read pending_bytes
>> - read data_offset - indicates kernel driver to write data to staging
>>   buffer which is mmapped.
>> - read data_size - amount of data in bytes written by vendor driver in migration
>>   region.
>> - if data section is trapped, pread() from data_offset of data_size.
>> - if data section is mmaped, read mmaped buffer of data_size.
>> - Write data packet to file stream as below:
>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>> VFIO_MIG_FLAG_END_OF_STATE }
>>
>> In _SAVING device state or stop-and-copy phase
>> a. read config space of device and save to migration file stream. This
>>    doesn't need to be from vendor driver. Any other special config state
>>    from driver can be saved as data in following iteration.
>> b. read pending_bytes
>> c. read data_offset - indicates kernel driver to write data to staging
>>    buffer which is mmapped.
>> d. read data_size - amount of data in bytes written by vendor driver in
>>    migration region.
>> e. if data section is trapped, pread() from data_offset of data_size.
>> f. if data section is mmaped, read mmaped buffer of data_size.
>> g. Write data packet as below:
>>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>> h. iterate through steps b to g while (pending_bytes > 0)
>> i. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>
>> When data region is mapped, its user's responsibility to read data from
>> data_offset of data_size before moving to next steps.
>>
> each iteration, vendor driver has to set data_offset for device data once
> and for dirty page once. it's really cumbersome.

This should be done so that vendor driver has the flexibility to decide
whether to have data in trapped region or mmapped region. For example,
device data can be in mmapped region and dirty pages data in trapped region.

> could dirty page and device data use different data_offset? e.g.
> data_offset, dirty_page_offset? or just keep them constant after being
> read first time?
> 

Reading device data and dirty page bitmap are atomic operations since
same region can be used to share those. Then having different variable
for both seems redundant.

Later option was discussion in v4 version and we decided not to assume
data_offset for each iteration
https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg05223.html

Thanks,
Kirti

>> .save_live_iterate runs outside the iothread lock in the migration case, which
>> could race with asynchronous call to get dirty page list causing data corruption
>> in mapped migration region. Mutex added here to serial migration buffer read
>> operation.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c  | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/trace-events |   6 ++
>>  2 files changed, 252 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 0597a45fda2d..4e9b4cce230b 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -117,6 +117,138 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>      return 0;
>>  }
>>  
>> +static void *find_data_region(VFIORegion *region,
>> +                              uint64_t data_offset,
>> +                              uint64_t data_size)
>> +{
>> +    void *ptr = NULL;
>> +    int i;
>> +
>> +    for (i = 0; i < region->nr_mmaps; i++) {
>> +        if ((data_offset >= region->mmaps[i].offset) &&
>> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
>> +            (data_size <= region->mmaps[i].size)) {
>> +            ptr = region->mmaps[i].mmap + (data_offset -
>> +                                           region->mmaps[i].offset);
>> +            break;
>> +        }
>> +    }
>> +    return ptr;
>> +}
>> +
>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint64_t data_offset = 0, data_size = 0;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +    if (ret != sizeof(data_offset)) {
>> +        error_report("%s: Failed to get migration buffer data offset %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_size));
>> +    if (ret != sizeof(data_size)) {
>> +        error_report("%s: Failed to get migration buffer data size %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (data_size > 0) {
>> +        void *buf = NULL;
>> +        bool buffer_mmaped;
>> +
>> +        if (region->mmaps) {
>> +            buf = find_data_region(region, data_offset, data_size);
>> +        }
>> +
>> +        buffer_mmaped = (buf != NULL) ? true : false;
>> +
>> +        if (!buffer_mmaped) {
>> +            buf = g_try_malloc0(data_size);
>> +            if (!buf) {
>> +                error_report("%s: Error allocating buffer ", __func__);
>> +                return -ENOMEM;
>> +            }
>> +
>> +            ret = pread(vbasedev->fd, buf, data_size,
>> +                        region->fd_offset + data_offset);
>> +            if (ret != data_size) {
>> +                error_report("%s: Failed to get migration data %d",
>> +                             vbasedev->name, ret);
>> +                g_free(buf);
>> +                return -EINVAL;
>> +            }
>> +        }
>> +
>> +        qemu_put_be64(f, data_size);
>> +        qemu_put_buffer(f, buf, data_size);
>> +
>> +        if (!buffer_mmaped) {
>> +            g_free(buf);
>> +        }
>> +        migration->pending_bytes -= data_size;
>> +    } else {
>> +        qemu_put_be64(f, data_size);
>> +    }
>> +
>> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
>> +                           migration->pending_bytes);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return data_size;
>> +}
>> +
>> +static int vfio_update_pending(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint64_t pending_bytes = 0;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             pending_bytes));
>> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
>> +        error_report("%s: Failed to get pending bytes %d",
>> +                     vbasedev->name, ret);
>> +        migration->pending_bytes = 0;
>> +        return (ret < 0) ? ret : -EINVAL;
>> +    }
>> +
>> +    migration->pending_bytes = pending_bytes;
>> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>> +
>> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
>> +        vbasedev->ops->vfio_save_config(vbasedev, f);
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    trace_vfio_save_device_config_state(vbasedev->name);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>  /* ---------------------------------------------------------------------- */
>>  
>>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -178,9 +310,123 @@ static void vfio_save_cleanup(void *opaque)
>>      trace_vfio_save_cleanup(vbasedev->name);
>>  }
>>  
>> +static void vfio_save_pending(QEMUFile *f, void *opaque,
>> +                              uint64_t threshold_size,
>> +                              uint64_t *res_precopy_only,
>> +                              uint64_t *res_compatible,
>> +                              uint64_t *res_postcopy_only)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return;
>> +    }
>> +
>> +    *res_precopy_only += migration->pending_bytes;
>> +
>> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
>> +                            *res_postcopy_only, *res_compatible);
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret, data_size;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +
>> +    qemu_mutex_lock(&migration->lock);
>> +    data_size = vfio_save_buffer(f, vbasedev);
>> +    qemu_mutex_unlock(&migration->lock);
>> +
>> +    if (data_size < 0) {
>> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
>> +                     strerror(errno));
>> +        return data_size;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_iterate(vbasedev->name, data_size);
>> +    if (data_size == 0) {
>> +        /* indicates data finished, goto complete phase */
>> +        return 1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> I think this state is already set in vm state change handler, where
> ~VFIO_DEVICE_STATE_RUNNING is applied to (VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING).
> 
> Why VFIO_DEVICE_STATE_SAVING is redundantly set here?

Ok, I'll remove and verify it.

Thanks,
Kirti

> 
>> +    if (ret) {
>> +        error_report("%s: Failed to set state STOP and SAVING",
>> +                     vbasedev->name);
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_save_device_config_state(f, opaque);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    while (migration->pending_bytes > 0) {
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +        ret = vfio_save_buffer(f, vbasedev);
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to save buffer", vbasedev->name);
>> +            return ret;
>> +        } else if (ret == 0) {
>> +            break;
>> +        }
>> +
>> +        ret = vfio_update_pending(vbasedev);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK);
> This state is a little weird.
> 
> Thanks
> Yan
>> +    if (ret) {
>> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_complete_precopy(vbasedev->name);
>> +    return ret;
>> +}
>> +
>>  static SaveVMHandlers savevm_vfio_handlers = {
>>      .save_setup = vfio_save_setup,
>>      .save_cleanup = vfio_save_cleanup,
>> +    .save_live_pending = vfio_save_pending,
>> +    .save_live_iterate = vfio_save_iterate,
>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>>  };
>>  
>>  /* ---------------------------------------------------------------------- */
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 4bb43f18f315..bdf40ba368c7 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -151,3 +151,9 @@ vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_st
>>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>>  vfio_save_setup(char *name) " (%s)"
>>  vfio_save_cleanup(char *name) " (%s)"
>> +vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
>> +vfio_update_pending(char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
>> +vfio_save_device_config_state(char *name) " (%s)"
>> +vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>> +vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
>> +vfio_save_complete_precopy(char *name) " (%s)"
>> -- 
>> 2.7.0
>>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM
  2019-07-22  8:23       ` Yan Zhao
@ 2019-08-20 20:31         ` Kirti Wankhede
  0 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-20 20:31 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, Dr. David Alan Gilbert, Wang,
	 Zhi A, mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 7/22/2019 1:53 PM, Yan Zhao wrote:
> On Fri, Jul 12, 2019 at 03:14:03AM +0800, Kirti Wankhede wrote:
>>
>>
>> On 7/11/2019 5:43 PM, Dr. David Alan Gilbert wrote:
>>> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>>>> VM state change handler gets called on change in VM's state. This is used to set
>>>> VFIO device state to _RUNNING.
>>>> VM state change handler, migration state change handler and log_sync listener
>>>> are called asynchronously, which sometimes lead to data corruption in migration
>>>> region. Initialised mutex that is used to serialize operations on migration data
>>>> region during saving state.
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>  hw/vfio/migration.c           | 64 +++++++++++++++++++++++++++++++++++++++++++
>>>>  hw/vfio/trace-events          |  2 ++
>>>>  include/hw/vfio/vfio-common.h |  4 +++
>>>>  3 files changed, 70 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index a2cfbd5af2e1..c01f08b659d0 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -78,6 +78,60 @@ err:
>>>>      return ret;
>>>>  }
>>>>  
>>>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    VFIORegion *region = &migration->region.buffer;
>>>> +    uint32_t device_state;
>>>> +    int ret = 0;
>>>> +
>>>> +    device_state = (state & VFIO_DEVICE_STATE_MASK) |
>>>> +                   (vbasedev->device_state & ~VFIO_DEVICE_STATE_MASK);
>>>> +
>>>> +    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
>>>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>> +                                              device_state));
>>>> +    if (ret < 0) {
>>>> +        error_report("%s: Failed to set device state %d %s",
>>>> +                     vbasedev->name, ret, strerror(errno));
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    vbasedev->device_state = device_state;
>>>> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +
>>>> +    if ((vbasedev->vm_running != running)) {
>>>> +        int ret;
>>>> +        uint32_t dev_state;
>>>> +
>>>> +        if (running) {
>>>> +            dev_state = VFIO_DEVICE_STATE_RUNNING;
>>>> +        } else {
>>>> +            dev_state = (vbasedev->device_state & VFIO_DEVICE_STATE_MASK) &
>>>> +                     ~VFIO_DEVICE_STATE_RUNNING;
>>>> +        }
>>>> +
>>>> +        ret = vfio_migration_set_state(vbasedev, dev_state);
>>>> +        if (ret) {
>>>> +            error_report("%s: Failed to set device state 0x%x",
>>>> +                         vbasedev->name, dev_state);
>>>> +        }
>>>> +        vbasedev->vm_running = running;
>>>> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
>>>> +                                  dev_state);
>>>> +    }
>>>> +}
>>>> +
>>>>  static int vfio_migration_init(VFIODevice *vbasedev,
>>>>                                 struct vfio_region_info *info)
>>>>  {
>>>> @@ -93,6 +147,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>>>          return ret;
>>>>      }
>>>>  
>>>> +    qemu_mutex_init(&vbasedev->migration->lock);
>>>
>>> Does this and it's friend below belong in this patch?  As far as I can
>>> tell you init/deinit the lock here but don't use it which is strange.
>>>
>>
>> This lock is used in
>> 0009-vfio-Add-save-state-functions-to-SaveVMHandlers.patch and
>> 0011-vfio-Add-function-to-get-dirty-page-list.patch
>>
>> Hm. I'll move this init/deinit to patch 0009 in next iteration.
>>
>> Thanks,
>> Kirti
>>
> This lock is used to protect vfio_save_buffer and read dirty page,
> right?
> if data subregion and bitmap subregion do not reuse "data_offset" in
> vfio_device_migration_info.
> It seems that this lock can be avoided.
> 

Same migration region either trapped or mapped can be used for device
data and dirty page. Its not just "data_offset"

Thanks,
Kirti


> Thanks
> Yan
> 
> 
>>
>>> Dave
>>>
>>>> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>>> +                                                          vbasedev);
>>>> +
>>>>      return 0;
>>>>  }
>>>>  
>>>> @@ -135,11 +194,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>>>>          return;
>>>>      }
>>>>  
>>>> +    if (vbasedev->vm_state) {
>>>> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
>>>> +    }
>>>> +
>>>>      if (vbasedev->migration_blocker) {
>>>>          migrate_del_blocker(vbasedev->migration_blocker);
>>>>          error_free(vbasedev->migration_blocker);
>>>>      }
>>>>  
>>>> +    qemu_mutex_destroy(&vbasedev->migration->lock);
>>>>      vfio_migration_region_exit(vbasedev);
>>>>      g_free(vbasedev->migration);
>>>>  }
>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>> index 191a726a1312..3d15bacd031a 100644
>>>> --- a/hw/vfio/trace-events
>>>> +++ b/hw/vfio/trace-events
>>>> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
>>>>  
>>>>  # migration.c
>>>>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>>>> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>>>> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index 152da3f8d6f3..f6c70db3a9c1 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -29,6 +29,7 @@
>>>>  #ifdef CONFIG_LINUX
>>>>  #include <linux/vfio.h>
>>>>  #endif
>>>> +#include "sysemu/sysemu.h"
>>>>  
>>>>  #define VFIO_MSG_PREFIX "vfio %s: "
>>>>  
>>>> @@ -124,6 +125,9 @@ typedef struct VFIODevice {
>>>>      unsigned int flags;
>>>>      VFIOMigration *migration;
>>>>      Error *migration_blocker;
>>>> +    uint32_t device_state;
>>>> +    VMChangeStateEntry *vm_state;
>>>> +    int vm_running;
>>>>  } VFIODevice;
>>>>  
>>>>  struct VFIODeviceOps {
>>>> -- 
>>>> 2.7.0
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 08/13] vfio: Register SaveVMHandlers for VFIO device
  2019-07-22  8:34   ` Yan Zhao
@ 2019-08-20 20:33     ` Kirti Wankhede
  2019-08-23  1:23       ` Yan Zhao
  0 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-20 20:33 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 7/22/2019 2:04 PM, Yan Zhao wrote:
> On Tue, Jul 09, 2019 at 05:49:15PM +0800, Kirti Wankhede wrote:
>> Define flags to be used as delimeter in migration file stream.
>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>> region from these functions at source during saving or pre-copy phase.
>> Set VFIO device state depending on VM's state. During live migration, VM is
>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c  | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  hw/vfio/trace-events |  2 ++
>>  2 files changed, 83 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index e4a89a6f9bc7..0597a45fda2d 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -23,6 +23,17 @@
>>  #include "pci.h"
>>  #include "trace.h"
>>  
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>> +
>>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>>  {
>>      VFIOMigration *migration = vbasedev->migration;
>> @@ -106,6 +117,74 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>>      return 0;
>>  }
>>  
>> +/* ---------------------------------------------------------------------- */
>> +
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +
>> +    if (migration->region.buffer.mmaps) {
>> +        qemu_mutex_lock_iothread();
>> +        ret = vfio_region_mmap(&migration->region.buffer);
>> +        qemu_mutex_unlock_iothread();
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.index,
>> +                         strerror(-ret));
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    if (vbasedev->vm_running) {
>> +        ret = vfio_migration_set_state(vbasedev,
>> +                         VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
>> +        if (ret) {
>> +            error_report("%s: Failed to set state RUNNING and SAVING",
>> +                         vbasedev->name);
>> +            return ret;
>> +        }
>> +    } else {
> hi Kirti
> May I know in which condition will this "else" case happen?
> 

This can happen in savevm case.

Thanks,
Kirti

> Thanks
> Yan
> 
>> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
>> +        if (ret) {
>> +            error_report("%s: Failed to set state STOP and SAVING",
>> +                         vbasedev->name);
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_setup(vbasedev->name);
>> +    return 0;
>> +}
>> +
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->region.buffer.mmaps) {
>> +        vfio_region_unmap(&migration->region.buffer);
>> +    }
>> +    trace_vfio_save_cleanup(vbasedev->name);
>> +}
>> +
>> +static SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +};
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>  {
>>      VFIODevice *vbasedev = opaque;
>> @@ -195,7 +274,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>      }
>>  
>>      qemu_mutex_init(&vbasedev->migration->lock);
>> -
>> +    register_savevm_live(vbasedev->dev, "vfio", -1, 1, &savevm_vfio_handlers,
>> +                         vbasedev);
>>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                            vbasedev);
>>  
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 69503228f20e..4bb43f18f315 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>>  vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>>  vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>> +vfio_save_setup(char *name) " (%s)"
>> +vfio_save_cleanup(char *name) " (%s)"
>> -- 
>> 2.7.0
>>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM
  2019-07-22  8:37   ` Yan Zhao
@ 2019-08-20 20:33     ` Kirti Wankhede
  2019-08-23  1:32       ` Yan Zhao
  0 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-20 20:33 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 7/22/2019 2:07 PM, Yan Zhao wrote:
> On Tue, Jul 09, 2019 at 05:49:13PM +0800, Kirti Wankhede wrote:
>> VM state change handler gets called on change in VM's state. This is used to set
>> VFIO device state to _RUNNING.
>> VM state change handler, migration state change handler and log_sync listener
>> are called asynchronously, which sometimes lead to data corruption in migration
>> region. Initialised mutex that is used to serialize operations on migration data
>> region during saving state.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c           | 64 +++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |  2 ++
>>  include/hw/vfio/vfio-common.h |  4 +++
>>  3 files changed, 70 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index a2cfbd5af2e1..c01f08b659d0 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -78,6 +78,60 @@ err:
>>      return ret;
>>  }
>>  
>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint32_t device_state;
>> +    int ret = 0;
>> +
>> +    device_state = (state & VFIO_DEVICE_STATE_MASK) |
>> +                   (vbasedev->device_state & ~VFIO_DEVICE_STATE_MASK);
>> +
>> +    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              device_state));
>> +    if (ret < 0) {
>> +        error_report("%s: Failed to set device state %d %s",
>> +                     vbasedev->name, ret, strerror(errno));
>> +        return ret;
>> +    }
>> +
>> +    vbasedev->device_state = device_state;
>> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
>> +    return 0;
>> +}
>> +
>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    if ((vbasedev->vm_running != running)) {
>> +        int ret;
>> +        uint32_t dev_state;
>> +
>> +        if (running) {
>> +            dev_state = VFIO_DEVICE_STATE_RUNNING;
> should be
> dev_state |= VFIO_DEVICE_STATE_RUNNING; ?
> 

vfio_migration_set_state() takes case of ORing.

Thanks,
Kirti

>> +        } else {
>> +            dev_state = (vbasedev->device_state & VFIO_DEVICE_STATE_MASK) &
>> +                     ~VFIO_DEVICE_STATE_RUNNING;
>> +        }
>> +
>> +        ret = vfio_migration_set_state(vbasedev, dev_state);
>> +        if (ret) {
>> +            error_report("%s: Failed to set device state 0x%x",
>> +                         vbasedev->name, dev_state);
>> +        }
>> +        vbasedev->vm_running = running;
>> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
>> +                                  dev_state);
>> +    }
>> +}
>> +
>>  static int vfio_migration_init(VFIODevice *vbasedev,
>>                                 struct vfio_region_info *info)
>>  {
>> @@ -93,6 +147,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>          return ret;
>>      }
>>  
>> +    qemu_mutex_init(&vbasedev->migration->lock);
>> +
>> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>> +                                                          vbasedev);
>> +
>>      return 0;
>>  }
>>  
>> @@ -135,11 +194,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>>          return;
>>      }
>>  
>> +    if (vbasedev->vm_state) {
>> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
>> +    }
>> +
>>      if (vbasedev->migration_blocker) {
>>          migrate_del_blocker(vbasedev->migration_blocker);
>>          error_free(vbasedev->migration_blocker);
>>      }
>>  
>> +    qemu_mutex_destroy(&vbasedev->migration->lock);
>>      vfio_migration_region_exit(vbasedev);
>>      g_free(vbasedev->migration);
>>  }
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 191a726a1312..3d15bacd031a 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
>>  
>>  # migration.c
>>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 152da3f8d6f3..f6c70db3a9c1 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -29,6 +29,7 @@
>>  #ifdef CONFIG_LINUX
>>  #include <linux/vfio.h>
>>  #endif
>> +#include "sysemu/sysemu.h"
>>  
>>  #define VFIO_MSG_PREFIX "vfio %s: "
>>  
>> @@ -124,6 +125,9 @@ typedef struct VFIODevice {
>>      unsigned int flags;
>>      VFIOMigration *migration;
>>      Error *migration_blocker;
>> +    uint32_t device_state;
>> +    VMChangeStateEntry *vm_state;
>> +    int vm_running;
>>  } VFIODevice;
>>  
>>  struct VFIODeviceOps {
>> -- 
>> 2.7.0
>>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list
  2019-07-22  8:39   ` Yan Zhao
@ 2019-08-20 20:34     ` Kirti Wankhede
  0 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-20 20:34 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	alex.williamson, eauger, cohuck, jonathan.davies, felipe,
	mlevitsk, Liu, Changpeng, Wang, Zhi A



On 7/22/2019 2:09 PM, Yan Zhao wrote:
> On Tue, Jul 09, 2019 at 05:49:18PM +0800, Kirti Wankhede wrote:
>> Dirty page tracking (.log_sync) is part of RAM copying state, where
>> vendor driver provides the bitmap of pages which are dirtied by vendor
>> driver through migration region and as part of RAM copy, those pages
>> gets copied to file stream.
>>
>> To get dirty page bitmap:
>> - write start address, page_size and pfn count.
>> - read count of pfns copied.
>>     - Vendor driver should return 0 if driver doesn't have any page to
>>       report dirty in given range.
>>     - Vendor driver should return -1 to mark all pages dirty for given range.
>> - read data_offset, where vendor driver has written bitmap.
>> - read bitmap from the region or mmaped part of the region.
>> - Iterate above steps till page bitmap for all requested pfns are copied.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/migration.c           | 123 ++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/vfio/vfio-common.h |   2 +
>>  3 files changed, 126 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 5fb4c5329ede..ca1a8c0f5f1f 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -269,6 +269,129 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>      return qemu_file_get_error(f);
>>  }
>>  
>> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
>> +                              uint64_t start_pfn,
>> +                              uint64_t pfn_count,
>> +                              uint64_t page_size)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region.buffer;
>> +    uint64_t count = 0;
>> +    int64_t copied_pfns = 0;
>> +    int64_t total_pfns = pfn_count;
>> +    int ret;
>> +
>> +    qemu_mutex_lock(&migration->lock);
>> +
>> +    while (total_pfns > 0) {
>> +        uint64_t bitmap_size, data_offset = 0;
>> +        uint64_t start = start_pfn + count;
>> +        void *buf = NULL;
>> +        bool buffer_mmaped = false;
>> +
>> +        ret = pwrite(vbasedev->fd, &start, sizeof(start),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              start_pfn));
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to set dirty pages start address %d %s",
>> +                         vbasedev->name, ret, strerror(errno));
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              page_size));
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to set dirty page size %d %s",
>> +                         vbasedev->name, ret, strerror(errno));
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        ret = pwrite(vbasedev->fd, &total_pfns, sizeof(total_pfns),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              total_pfns));
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to set dirty page total pfns %d %s",
>> +                         vbasedev->name, ret, strerror(errno));
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        /* Read copied dirty pfns */
>> +        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             copied_pfns));
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to get dirty pages bitmap count %d %s",
>> +                         vbasedev->name, ret, strerror(errno));
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_NONE) {
>> +            /*
>> +             * copied_pfns could be 0 if driver doesn't have any page to
>> +             * report dirty in given range
>> +             */
>> +            break;
>> +        } else if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_ALL) {
>> +            /* Mark all pages dirty for this range */
>> +            cpu_physical_memory_set_dirty_range(start_pfn * page_size,
>> +                                                pfn_count * page_size,
>> +                                                DIRTY_MEMORY_MIGRATION);
>> +            break;
>> +        }
>> +
>> +        bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
> hi Kirti
> 
> why bitmap_size is 
> (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long).
> why it's not
> BITS_TO_LONGS(copied_pfns) * sizeof(unsigned long) ?
> 

It should be later. I'll update in next version.

Thanks,
Kirti


> Thanks
> Yan
> 
>> +        ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +        if (ret != sizeof(data_offset)) {
>> +            error_report("%s: Failed to get migration buffer data offset %d",
>> +                         vbasedev->name, ret);
>> +            goto dpl_unlock;
>> +        }
>> +
>> +        if (region->mmaps) {
>> +            buf = find_data_region(region, data_offset, bitmap_size);
>> +        }
>> +
>> +        buffer_mmaped = (buf != NULL) ? true : false;
>> +
>> +        if (!buffer_mmaped) {
>> +            buf = g_try_malloc0(bitmap_size);
>> +            if (!buf) {
>> +                error_report("%s: Error allocating buffer ", __func__);
>> +                goto dpl_unlock;
>> +            }
>> +
>> +            ret = pread(vbasedev->fd, buf, bitmap_size,
>> +                        region->fd_offset + data_offset);
>> +            if (ret != bitmap_size) {
>> +                error_report("%s: Failed to get dirty pages bitmap %d",
>> +                             vbasedev->name, ret);
>> +                g_free(buf);
>> +                goto dpl_unlock;
>> +            }
>> +        }
>> +
>> +        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>> +                                               (start_pfn + count) * page_size,
>> +                                                copied_pfns);
>> +        count      += copied_pfns;
>> +        total_pfns -= copied_pfns;
>> +
>> +        if (!buffer_mmaped) {
>> +            g_free(buf);
>> +        }
>> +    }
>> +
>> +    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
>> +                                   page_size);
>> +
>> +dpl_unlock:
>> +    qemu_mutex_unlock(&migration->lock);
>> +}
>> +
>>  /* ---------------------------------------------------------------------- */
>>  
>>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index ac065b559f4e..414a5e69ec5e 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
>>  vfio_load_device_config_state(char *name) " (%s)"
>>  vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
>>  vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
>> +vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index a022484d2636..dc1b83a0b4ef 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -222,5 +222,7 @@ int vfio_spapr_remove_window(VFIOContainer *container,
>>  
>>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
>>  void vfio_migration_finalize(VFIODevice *vbasedev);
>> +void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
>> +                               uint64_t pfn_count, uint64_t page_size);
>>  
>>  #endif /* HW_VFIO_VFIO_COMMON_H */
>> -- 
>> 2.7.0
>>
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 10/13] vfio: Add load state functions to SaveVMHandlers
  2019-07-22 21:50           ` Yan Zhao
@ 2019-08-20 20:35             ` Kirti Wankhede
  0 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-20 20:35 UTC (permalink / raw)
  To: Yan Zhao, Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	eauger, cohuck, jonathan.davies, felipe, mlevitsk, Liu,
	Changpeng, Wang, Zhi A



On 7/23/2019 3:20 AM, Yan Zhao wrote:
> On Tue, Jul 23, 2019 at 03:07:13AM +0800, Alex Williamson wrote:
>> On Sun, 21 Jul 2019 23:20:28 -0400
>> Yan Zhao <yan.y.zhao@intel.com> wrote:
>>
>>> On Fri, Jul 19, 2019 at 03:00:13AM +0800, Kirti Wankhede wrote:
>>>>
>>>>
>>>> On 7/12/2019 8:22 AM, Yan Zhao wrote:  
>>>>> On Tue, Jul 09, 2019 at 05:49:17PM +0800, Kirti Wankhede wrote:  
>>>>>> Flow during _RESUMING device state:
>>>>>> - If Vendor driver defines mappable region, mmap migration region.
>>>>>> - Load config state.
>>>>>> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
>>>>>>     - read data_size from packet, read buffer of data_size
>>>>>>     - read data_offset from where QEMU should write data.
>>>>>>         if region is mmaped, write data of data_size to mmaped region.
>>>>>>     - write data_size.
>>>>>>         In case of mmapped region, write to data_size indicates kernel
>>>>>>         driver that data is written in staging buffer.
>>>>>>     - if region is trapped, pwrite() data of data_size from data_offset.
>>>>>> - Repeat above until VFIO_MIG_FLAG_END_OF_STATE.
>>>>>> - Unmap migration region.
>>>>>>
>>>>>> For user, data is opaque. User should write data in the same order as
>>>>>> received.
>>>>>>
>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>> ---
>>>>>>  hw/vfio/migration.c  | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  hw/vfio/trace-events |   3 +
>>>>>>  2 files changed, 165 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>>> index 4e9b4cce230b..5fb4c5329ede 100644
>>>>>> --- a/hw/vfio/migration.c
>>>>>> +++ b/hw/vfio/migration.c
>>>>>> @@ -249,6 +249,26 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>>>>>>      return qemu_file_get_error(f);
>>>>>>  }
>>>>>>  
>>>>>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>>>>> +{
>>>>>> +    VFIODevice *vbasedev = opaque;
>>>>>> +    uint64_t data;
>>>>>> +
>>>>>> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>>>>>> +        vbasedev->ops->vfio_load_config(vbasedev, f);
>>>>>> +    }
>>>>>> +
>>>>>> +    data = qemu_get_be64(f);
>>>>>> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
>>>>>> +        error_report("%s: Failed loading device config space, "
>>>>>> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
>>>>>> +        return -EINVAL;
>>>>>> +    }
>>>>>> +
>>>>>> +    trace_vfio_load_device_config_state(vbasedev->name);
>>>>>> +    return qemu_file_get_error(f);
>>>>>> +}
>>>>>> +
>>>>>>  /* ---------------------------------------------------------------------- */
>>>>>>  
>>>>>>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>>>>>> @@ -421,12 +441,154 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>>>>      return ret;
>>>>>>  }
>>>>>>  
>>>>>> +static int vfio_load_setup(QEMUFile *f, void *opaque)
>>>>>> +{
>>>>>> +    VFIODevice *vbasedev = opaque;
>>>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>>> +    int ret = 0;
>>>>>> +
>>>>>> +    if (migration->region.buffer.mmaps) {
>>>>>> +        ret = vfio_region_mmap(&migration->region.buffer);
>>>>>> +        if (ret) {
>>>>>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>>>>>> +                         vbasedev->name, migration->region.index,
>>>>>> +                         strerror(-ret));
>>>>>> +            return ret;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING);
>>>>>> +    if (ret) {
>>>>>> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
>>>>>> +    }
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static int vfio_load_cleanup(void *opaque)
>>>>>> +{
>>>>>> +    vfio_save_cleanup(opaque);
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>>>> +{
>>>>>> +    VFIODevice *vbasedev = opaque;
>>>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>>> +    int ret = 0;
>>>>>> +    uint64_t data, data_size;
>>>>>> +  
>>>>> I think checking of version_id is still needed.
>>>>>   
>>>>
>>>> Checking version_id with what value?
>>>>  
>>> this version_id passed-in is the source VFIO software interface id.
>>> need to check it with the value in target side, right?
>>>
>>> Though we previously discussed the sysfs node interface to check live
>>> migration version even before launching live migration, I think we still
>>> need this runtime software version check in qemu to ensure software
>>> interfaces in QEMU VFIO are compatible.
>>
>> Do we want QEMU to interact directly with sysfs for that, which would
>> require write privileges to sysfs, or do we want to suggest that vendor
>> drivers should include equivalent information early in their migration
>> data stream to force a migration failure as early as possible for
>> incompatible data?  I think we need the latter regardless because the
>> vendor driver should never trust userspace like that, but does that
>> make any QEMU use of the sysfs version test itself redundant?  Thanks,
>>
>> Alex
> 
> hi Alex
> I think QEMU needs to check at least the code version of software interface in
> QEMU, like format of migration region, details of migration protocol,
> IOW, the software version QEMU interacts with vendor driver.
> This information should not be known to vendor driver until migration
> running to certain phase.
> e.g. if saving flow or format in source qemu is changed a little as a result
> of software upgrading, target qemu has to detect that from this
> version_id check, as vendor driver has no knowledge of that.
> Does that make sense?
> 

That is already done in qemu_loadvm_section_start_full()

    /* Validate version */
    if (version_id > se->version_id) {
        error_report("savevm: unsupported version %d for '%s' v%d",
                     version_id, idstr, se->version_id);
        return -EINVAL;
    }
    se->load_version_id = version_id;

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface
  2019-07-16 20:56   ` Alex Williamson
  2019-07-17 11:55     ` Cornelia Huck
  2019-07-23 12:13     ` Cornelia Huck
@ 2019-08-21 20:31     ` Kirti Wankhede
  2 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-21 20:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue


Sorry for the delay.

On 7/17/2019 2:26 AM, Alex Williamson wrote:
> On Tue, 9 Jul 2019 15:19:08 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Defined MIGRATION region type and sub-type.
>> - Used 3 bits to define VFIO device states.
>>     Bit 0 => _RUNNING
>>     Bit 1 => _SAVING
>>     Bit 2 => _RESUMING
>>     Combination of these bits defines VFIO device's state during migration
>>     _STOPPED => All bits 0 indicates VFIO device stopped.
>>     _RUNNING => Normal VFIO device running state.
>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>>                           saving state of device i.e. pre-copy state
>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>>                           save device state,i.e. stop-n-copy state
>>     _RESUMING => VFIO device resuming state.
>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>   offset of migration region to get/set VFIO device related information.
>>   Defined members of structure and usage on read/write access:
>>     * device_state: (read/write)
>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>>         as of now.
>>     * pending bytes: (read only)
>>         To get pending bytes yet to be migrated for VFIO device.
>>     * data_offset: (read only)
>>         To get data offset in migration from where data exist during _SAVING
>>         and from where data should be written by user space application during
>>          _RESUMING state
>>     * data_size: (read/write)
>>         To get and set size of data copied in migration region during _SAVING
>>         and _RESUMING state.
>>     * start_pfn, page_size, total_pfns: (write only)
>>         To get bitmap of dirty pages from vendor driver from given
>>         start address for total_pfns.
>>     * copied_pfns: (read only)
>>         To get number of pfns bitmap copied in migration region.
>>         Vendor driver should copy the bitmap with bits set only for
>>         pages to be marked dirty in migration region. Vendor driver
>>         should return 0 if there are 0 pages dirty in requested
>>         range. Vendor driver should return -1 to mark all pages in the section
>>         as dirty
>>
>> Migration region looks like:
>>  ------------------------------------------------------------------
>> |vfio_device_migration_info|    data section                      |
>> |                          |     ///////////////////////////////  |
>>  ------------------------------------------------------------------
>>  ^                              ^                              ^
>>  offset 0-trapped part        data_offset                 data_size
>>
>> Data section is always followed by vfio_device_migration_info
>> structure in the region, so data_offset will always be none-0.
>> Offset from where data is copied is decided by kernel driver, data
>> section can be trapped or mapped depending on how kernel driver
>> defines data section. If mmapped, then data_offset should be page
>> aligned, where as initial section which contain
>> vfio_device_migration_info structure might not end at offset which
>> is page aligned.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  linux-headers/linux/vfio.h | 166 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 166 insertions(+)
>>
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 24f505199f83..6696a4600545 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -372,6 +372,172 @@ struct vfio_region_gfx_edid {
>>   */
>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>>  
>> +/* Migration region type and sub-type */
>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
> 
> Region type #2 is already claimed by VFIO_REGION_TYPE_CCW, so this would
> need to be #3 or greater (we should have a reference table somewhere in
> this header as it gets easier to miss claimed entries as the sprawl
> grows).
> 
>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>> +
>> +/**
>> + * Structure vfio_device_migration_info is placed at 0th offset of
>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>> + * information. Field accesses from this structure are only supported at their
>> + * native width and alignment, otherwise should return error.
> 
> This seems like a good unit test, a userspace driver that performs
> unaligned accesses to this space.  I'm afraid the wording above might
> suggest that if there's no error it must work though, which might put
> us in sticky support situations.  Should we say:
> 
> s/should return error/the result is undefined and vendor drivers should
> return an error/
> 
>> + *
>> + * device_state: (read/write)
>> + *      To indicate vendor driver the state VFIO device should be transitioned
>> + *      to. If device state transition fails, write on this field return error.
>> + *      It consists of 3 bits:
>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>> + *        device before write() returns.
>> + *      - If bit 1 set, indicates _SAVING state.
>> + *      - If bit 2 set, indicates _RESUMING state.
>> + *      _SAVING and _RESUMING set at the same time is invalid state.
> 
> I think in the previous version there was a question of how we handle
> yet-to-be-defined bits.  For instance, if we defined a
> SUBTYPE_MIGRATIONv2 with the intention of making it backwards
> compatible with this version, do we declare the undefined bits as
> preserved so that the user should do a read-modify-write operation?
> 

Yes, Updating comment accordingly.

>> + * pending bytes: (read only)
>> + *      Number of pending bytes yet to be migrated from vendor driver
> 
> Is this for _SAVING, _RESUMING, or both?
> 

It is for _SAVING only.

>> + *
>> + * data_offset: (read only)
>> + *      User application should read data_offset in migration region from where
>> + *      user application should read device data during _SAVING state or write
>> + *      device data during _RESUMING state or read dirty pages bitmap. See below
>> + *      for detail of sequence to be followed.
>> + *
>> + * data_size: (read/write)
>> + *      User application should read data_size to get size of data copied in
>> + *      migration region during _SAVING state and write size of data copied in
>> + *      migration region during _RESUMING state.
>> + *
>> + * start_pfn: (write only)
>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
>> + *      _SAVING state.
> 
> There are some subtleties in PFN that I'm not sure we're accounting for
> here.  Devices operate in an IOVA space, which is defined by DMA_MAP
> calls.  The user says this IOVA maps to this process virtual address.
> When there is no vIOMMU, we can \assume\ that IOVA ~= GPA and therefore
> this interface provides dirty gfns.  However when we have a vIOMMU, we
> don't know the IOVA to GPA mapping, right?  So is it expected that the
> user is calling this with GFNs relative to the device address space
> (IOVA) or relative to the VM address space (GPA)?  For the kernel
> internal mdev interface, the pin pages API is always operating in the
> device view and I think never cares if those are IOVA or GPA.
> 

Its IOVA.

>> + *
>> + * page_size: (write only)
>> + *      User application should write the page_size of pfn.
>> + *
>> + * total_pfns: (write only)
>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
>> + *
>> + * copied_pfns: (read only)
>> + *      pfn count for which dirty bitmap is copied to migration region.
>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
>> + *      marked dirty in migration region.
>> + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if none of the
>> + *        pages are dirty in requested range or rest of the range.
>> + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
>> + *        pages dirty in the given section.
> 
> Does this have the same semantics as _NONE in being able to use it to
> report "all the remaining unreported pfns are dirty"?

Yes. Correcting logic accordingly in next version.

> 
>> + *      - Vendor driver should return pfn count for which bitmap is written in
>> + *        the region.
>> + *
>> + * Migration region looks like:
>> + *  ------------------------------------------------------------------
>> + * |vfio_device_migration_info|    data section                      |
>> + * |                          |     ///////////////////////////////  |
>> + * ------------------------------------------------------------------
>> + *   ^                              ^                              ^
>> + *  offset 0-trapped part        data_offset                 data_size
>> + *
>> + * Data section is always followed by vfio_device_migration_info structure
>> + * in the region, so data_offset will always be none-0. Offset from where data
> 
> s/none-0/non-0/  Or better, non-zero
> 
>> + * is copied is decided by kernel driver, data section can be trapped or
>> + * mapped depending on how kernel driver defines data section. If mmapped,
>> + * then data_offset should be page aligned, where as initial section which
>> + * contain vfio_device_migration_info structure might not end at offset which
>> + * is page aligned.
>> + * Data_offset can be same or different for device data and dirty page bitmap.
>> + * Vendor driver should decide whether to partition data section and how to
>> + * partition the data section. Vendor driver should return data_offset
>> + * accordingly.
> 
> I think we also want to talk about how the mmap support within this
> region is defined by a sparse mmap capability (this is required if
> any of it is mmap capable to support the non-mmap'd header) and the
> vendor driver can make portions of the data section mmap'able and
> others not.  I believe (unless we want to require otherwise) that the
> data_offset to data_offset+data_size range can arbitrarily span mmap
> supported sections to meet the vendor driver's needs.
> 
>> + *
>> + * Sequence to be followed:
>> + * In _SAVING|_RUNNING device state or pre-copy phase:
>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>> + * b. read data_offset, indicates kernel driver to write data to staging buffer
>> + *    which is mmapped.
> 
> There's no requirement that it be mmap'd, right?  The vendor driver has
> the choice whether to support mmap, the user has the choice whether to
> access via mmap or read/write.
> 
>> + * c. read data_size, amount of data in bytes written by vendor driver in
>> + *    migration region.
>> + * d. if data section is trapped, read from data_offset of data_size.
>> + * e. if data section is mmaped, read data_size bytes from mmaped buffer from
>> + *    data_offset in the migration region.
> 
> Is it really necessary to specify these separately?  The user should
> read from data_offset to data_offset+data_size, optionally via direct
> mapped buffer as supported by the sparse mmap support within the region.
> 
>> + * f. Write data_size and data to file stream.
> 
> This is not really part of our specification, the user does whatever
> they want with the data.
> 
>> + * g. iterate through steps a to f while (pending_bytes > 0)
> 
> Is the read of pending_bytes an implicit indication to the vendor
> driver that the data area has been consumed?  If so, should this
> sequence always end with a read of pending_bytes to indicate to the
> vendor driver to flush that data?  I'm assuming there will be gap where
> the user reads save data from the device, does other things, and comes
> back to read more data.
> 
> What assumptions, if any, can the user make about pending_bytes?  For
> instance, if the device is _RUNNING, I assume no assumptions can be
> made, maybe with the exception that it represents the minimum pending
> state at that instant of time.

That's right.

>  The rate at which we're approaching
> convergence might be inferred, but any method to determine that would
> be beyond the scope here.
> 
>> + * In _SAVING device state or stop-and-copy phase:
>> + * a. read config space of device and save to migration file stream. This
>> + *    doesn't need to be from vendor driver. Any other special config state
>> + *    from driver can be saved as data in following iteration.
> 
> This is beyond the scope of the migration interface here (and config
> space is PCI specific).
> 
>> + * b. read pending_bytes.
>> + * c. read data_offset, indicates kernel driver to write data to staging
>> + *    buffer which is mmapped.
> 
> Or not.
> 
>> + * d. read data_size, amount of data in bytes written by vendor driver in
>> + *    migration region.
>> + * e. if data section is trapped, read from data_offset of data_size.
>> + * f. if data section is mmaped, read data_size bytes from mmaped buffer from
>> + *    data_offset in the migration region.
> 
> Same comment as above.
> 
>> + * g. Write data_size and data to file stream
> 
> Outside of the scope.
> 
>> + * h. iterate through steps b to g while (pending_bytes > 0)
> 
> Same question regarding indicating to vendor driver that the buffer has
> been consumed.
> 
>> + *
>> + * When data region is mapped, its user's responsibility to read data from
>> + * data_offset of data_size before moving to next steps.
> 
> Do we really want to condition this on being mmap'd?  This implies that
> when it is not mmap'd the vendor driver tracks the accesses to make
> sure that it was consumed?
> 
>> + * Dirty page tracking is part of RAM copy state, where vendor driver
>> + * provides the bitmap of pages which are dirtied by vendor driver through
>> + * migration region and as part of RAM copy those pages gets copied to file
>> + * stream.
> 
> We're mixing QEMU/VM use cases here, this is only the kernel interface
> spec, which can be used for such things, but is not tied to them.  RAM
> ties to the previous question of the address space and implies we're
> operating in the GFN space while the device really only knows about the
> IOVA space.
> 
>> + *
>> + * To get dirty page bitmap:
>> + * a. write start_pfn, page_size and total_pfns.
> 
> Is it required to write every field every time?  For instance page_size
> seems like it should only ever need to be written once.  Is there any
> ordering required?  It seems like step b) initiates the vendor driver
> to consume these fields, but that's not specified below.
> 
>> + * b. read copied_pfns.
>> + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if driver
>> + *       doesn't have any page to report dirty in given range or rest of the
>> + *       range. Exit loop.
>> + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
>> + *       pages dirty for given range. Mark all pages in the range as dirty and
>> + *       exit the loop.
>> + *     - Vendor driver should return copied_pfns and provide bitmap for
>> + *       copied_pfn, which means that bitmap copied for given range contains
>> + *       information for all pages where some bits are 0s and some are 1s.
>> + * c. read data_offset, where vendor driver has written bitmap.
>> + * d. read bitmap from the region or mmaped part of the region.
>> + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
> 
> I thought there was some automatic iteration built into this interface,
> is that dropped? 

Yes, that was discussed in previous version review.

> The user is now expected to do start_pf +=
> copied_pfns and total_pfns -= copied_pfns themsevles?

Yes.

>  Does anything
> indicate to the vendor driver when the data area has been consumed such
> that resources can be released?
> 

If data section is trapped, then read callback on that region gives
confirmation that data has been consumed.
If data section is mmaped, then there has to be a staging buffer. On
read(data_offset), data should be copied to staging buffer, but vendor
driver doesn't get any indication when the read is done. In this case,
vendor driver should copy data to staging buffer and release resources.

>> + *
>> + * In _RESUMING device state:
>> + * - Load device config state.
> 
> Out of scope.
> 
>> + * - While end of data for this device is not reached, repeat below steps:
>> + *      - read data_size from file stream, read data from file stream of
>> + *        data_size.
> 
> Out of scope, how the user gets the data is a userspace implementation
> detail.  I think the important detail here is simply that each data
> transaction from the _SAVING process is indivisible and must translate
> to a _RESUMING transaction here.
> 
>> + *      - read data_offset from where User application should write data.
>> + *          if region is mmaped, write data of data_size to mmaped region.
>> + *      - write data_size.
>> + *          In case of mmapped region, write on data_size indicates kernel
>> + *          driver that data is written in staging buffer.
>> + *      - if region is trapped, write data of data_size from data_offset.
> 
> Gack!  We need something better here, the sequence should be the same
> regardless of the mechanism used to write the data.
> 

ok.

> It still confuses me how the resuming side can know where (data_offset)
> the incoming data should be written. 

User space application read on data_offset, vendor driver provides this
data_offset

> If we're migrating a !_RUNNING
> device, then I can see how some portion of the device might be directly
> mmap'd and the sequence would be very deterministic.  But if we're
> migrating a _RUNNING device, wouldn't the current data block depend on
> what portions of the device are active, which would be difficult to
> predict?
> 

Vendor driver should be able to predict from the data, for user space
application, data is opaque. See comment below.

>> + *
>> + * For user application, data is opaque. User should write data in the same
>> + * order as received.
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +        __u32 device_state;         /* VFIO device state */
>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>> +                                     VFIO_DEVICE_STATE_SAVING | \
>> +                                     VFIO_DEVICE_STATE_RESUMING)
> 
> Yes, we have the mask in here now, but no mention above how the user
> should handle undefined bits.

Updating the comment.

Thanks,
Kirti


>  Thanks,
> 
> Alex
> 
>> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
>> +                                     VFIO_DEVICE_STATE_RESUMING)
>> +        __u32 reserved;
>> +        __u64 pending_bytes;
>> +        __u64 data_offset;
>> +        __u64 data_size;
>> +        __u64 start_pfn;
>> +        __u64 page_size;
>> +        __u64 total_pfns;
>> +        __u64 copied_pfns;
>> +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
>> +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
>> +} __attribute__((packed));
>> +
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>   * which allows direct access to non-MSIX registers which happened to be within
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface
  2019-07-23 12:13     ` Cornelia Huck
@ 2019-08-21 20:32       ` Kirti Wankhede
  0 siblings, 0 replies; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-21 20:32 UTC (permalink / raw)
  To: Cornelia Huck, Alex Williamson
  Cc: kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert, pasic, aik,
	eauger, felipe, jonathan.davies, yan.y.zhao, mlevitsk,
	changpeng.liu, zhi.a.wang



On 7/23/2019 5:43 PM, Cornelia Huck wrote:
> On Tue, 16 Jul 2019 14:56:32 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Tue, 9 Jul 2019 15:19:08 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> I'm still a bit unsure about the device_state bit handling as well.
> 
>>> + * device_state: (read/write)
>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>> + *      to. If device state transition fails, write on this field return error.
> 
> Does 'device state transition fails' include 'the device state written
> was invalid'?
> 

Yes.

>>> + *      It consists of 3 bits:
>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>>> + *        device before write() returns.
> 
> So _STOPPED is always !_RUNNING, regardless of which other bits are set?
>

Yes.

>>> + *      - If bit 1 set, indicates _SAVING state.
>>> + *      - If bit 2 set, indicates _RESUMING state.
>>> + *      _SAVING and _RESUMING set at the same time is invalid state.  
> 
> What about _RUNNING | _RESUMING -- does that make sense?
>

I think this will be valid state in postcopy case, though I'm not very sure.


>>
>> I think in the previous version there was a question of how we handle
>> yet-to-be-defined bits.  For instance, if we defined a
>> SUBTYPE_MIGRATIONv2 with the intention of making it backwards
>> compatible with this version, do we declare the undefined bits as
>> preserved so that the user should do a read-modify-write operation?
> 
> Or can we state that undefined bits are ignored, and may or may not
> preserved, so that we can skip the read-modify-write requirement? v1
> and v2 can hopefully be distinguished in a different way.
> 

Updating comment in next version.

Thanks,
Kirti

> (...)
> 
>>> +struct vfio_device_migration_info {
>>> +        __u32 device_state;         /* VFIO device state */
>>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>>> +                                     VFIO_DEVICE_STATE_SAVING | \
>>> +                                     VFIO_DEVICE_STATE_RESUMING)  
>>
>> Yes, we have the mask in here now, but no mention above how the user
>> should handle undefined bits.  Thanks,
>>
>> Alex
>>
>>> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
>>> +                                     VFIO_DEVICE_STATE_RESUMING)
> 
> As mentioned above, does _RESUMING | _RUNNING make sense?
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-07-11 12:07   ` Dr. David Alan Gilbert
@ 2019-08-22  4:50     ` Kirti Wankhede
  2019-08-22  9:32       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-22  4:50 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

Sorry for delay to respond.

On 7/11/2019 5:37 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> These functions save and restore PCI device specific data - config
>> space of PCI device.
>> Tested save and restore with MSI and MSIX type.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  hw/vfio/pci.c                 | 114 ++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |   2 +
>>  2 files changed, 116 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index de0d286fc9dd..5fe4f8076cac 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2395,11 +2395,125 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>      return OBJECT(vdev);
>>  }
>>  
>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint16_t pci_cmd;
>> +    int i;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar;
>> +
>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>> +        qemu_put_be32(f, bar);
>> +    }
>> +
>> +    qemu_put_be32(f, vdev->interrupt);
>> +    if (vdev->interrupt == VFIO_INT_MSI) {
>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +        bool msi_64bit;
>> +
>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                                            2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        msi_addr_lo = pci_default_read_config(pdev,
>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>> +        qemu_put_be32(f, msi_addr_lo);
>> +
>> +        if (msi_64bit) {
>> +            msi_addr_hi = pci_default_read_config(pdev,
>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                             4);
>> +        }
>> +        qemu_put_be32(f, msi_addr_hi);
>> +
>> +        msi_data = pci_default_read_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                2);
>> +        qemu_put_be32(f, msi_data);
>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
>> +        uint16_t offset;
>> +
>> +        /* save enable bit and maskall bit */
>> +        offset = pci_default_read_config(pdev,
>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>> +        qemu_put_be16(f, offset);
>> +        msix_save(pdev, f);
>> +    }
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    qemu_put_be16(f, pci_cmd);
>> +}
>> +
>> +static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint32_t interrupt_type;
>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +    uint16_t pci_cmd;
>> +    bool msi_64bit;
>> +    int i;
>> +
>> +    /* retore pci bar configuration */
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar = qemu_get_be32(f);
>> +
>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>> +    }
> 
> Is it possible to validate the bar's at all?  We just had a bug on a
> virtual device where one version was asking for a larger bar than the
> other; our validation caught this in some cases so we could tell that
> the guest had a BAR that was aligned at the wrong alignment.
> 

"Validate the bars" does that means validate size of bars?

>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>> +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> 
> Can you explain what this is for?  You write the command register at the
> end of the function with the original value; there's no guarantee that
> the device is using IO for example, so ORing it seems odd.
> 

IO space and memory space accesses are disabled before writing BAR
addresses, only those are enabled here.

> Also, are the other flags in COMMAND safe at this point - e.g. what
> about interrupts and stuff?
> 

COMMAND registers is saved from stop-and-copy phase, interrupt should be
disabled, then restoring here when vCPU are not yet running.

>> +    interrupt_type = qemu_get_be32(f);
>> +
>> +    if (interrupt_type == VFIO_INT_MSI) {
>> +        /* restore msi configuration */
>> +        msi_flags = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>> +
>> +        msi_addr_lo = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>> +                              msi_addr_lo, 4);
>> +
>> +        msi_addr_hi = qemu_get_be32(f);
>> +        if (msi_64bit) {
>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                  msi_addr_hi, 4);
>> +        }
>> +        msi_data = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                msi_data, 2);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>> +        uint16_t offset = qemu_get_be16(f);
>> +
>> +        /* load enable bit and maskall bit */
>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>> +                              offset, 2);
>> +        msix_load(pdev, f);
>> +    }
>> +    pci_cmd = qemu_get_be16(f);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
>> +}
>> +
>>  static VFIODeviceOps vfio_pci_ops = {
>>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>>      .vfio_eoi = vfio_intx_eoi,
>>      .vfio_get_object = vfio_pci_get_object,
>> +    .vfio_save_config = vfio_pci_save_config,
>> +    .vfio_load_config = vfio_pci_load_config,
>>  };
>>  
>>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 771b6d59a3db..ee72bd984a36 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>>      void (*vfio_eoi)(VFIODevice *vdev);
>>      Object *(*vfio_get_object)(VFIODevice *vdev);
>> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
>> +    void (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>>  };
>>  
>>  typedef struct VFIOGroup {
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-08-22  4:50     ` Kirti Wankhede
@ 2019-08-22  9:32       ` Dr. David Alan Gilbert
  2019-08-22 19:10         ` Kirti Wankhede
  0 siblings, 1 reply; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-22  9:32 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Sorry for delay to respond.
> 
> On 7/11/2019 5:37 PM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >> These functions save and restore PCI device specific data - config
> >> space of PCI device.
> >> Tested save and restore with MSI and MSIX type.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/pci.c                 | 114 ++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/vfio/vfio-common.h |   2 +
> >>  2 files changed, 116 insertions(+)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index de0d286fc9dd..5fe4f8076cac 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -2395,11 +2395,125 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >>      return OBJECT(vdev);
> >>  }
> >>  
> >> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +    uint16_t pci_cmd;
> >> +    int i;
> >> +
> >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >> +        uint32_t bar;
> >> +
> >> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> >> +        qemu_put_be32(f, bar);
> >> +    }
> >> +
> >> +    qemu_put_be32(f, vdev->interrupt);
> >> +    if (vdev->interrupt == VFIO_INT_MSI) {
> >> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >> +        bool msi_64bit;
> >> +
> >> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                                            2);
> >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >> +
> >> +        msi_addr_lo = pci_default_read_config(pdev,
> >> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> >> +        qemu_put_be32(f, msi_addr_lo);
> >> +
> >> +        if (msi_64bit) {
> >> +            msi_addr_hi = pci_default_read_config(pdev,
> >> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >> +                                             4);
> >> +        }
> >> +        qemu_put_be32(f, msi_addr_hi);
> >> +
> >> +        msi_data = pci_default_read_config(pdev,
> >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >> +                2);
> >> +        qemu_put_be32(f, msi_data);
> >> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> >> +        uint16_t offset;
> >> +
> >> +        /* save enable bit and maskall bit */
> >> +        offset = pci_default_read_config(pdev,
> >> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> >> +        qemu_put_be16(f, offset);
> >> +        msix_save(pdev, f);
> >> +    }
> >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >> +    qemu_put_be16(f, pci_cmd);
> >> +}
> >> +
> >> +static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +    uint32_t interrupt_type;
> >> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >> +    uint16_t pci_cmd;
> >> +    bool msi_64bit;
> >> +    int i;
> >> +
> >> +    /* retore pci bar configuration */
> >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >> +        uint32_t bar = qemu_get_be32(f);
> >> +
> >> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> >> +    }
> > 
> > Is it possible to validate the bar's at all?  We just had a bug on a
> > virtual device where one version was asking for a larger bar than the
> > other; our validation caught this in some cases so we could tell that
> > the guest had a BAR that was aligned at the wrong alignment.
> > 
> 
> "Validate the bars" does that means validate size of bars?

I meant validate the address programmed into the BAR against the size,
assuming you know the size; e.g. if it's a 128MB BAR, then make sure the
address programmed in is 128MB aligned.

> >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >> +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> > 
> > Can you explain what this is for?  You write the command register at the
> > end of the function with the original value; there's no guarantee that
> > the device is using IO for example, so ORing it seems odd.
> > 
> 
> IO space and memory space accesses are disabled before writing BAR
> addresses, only those are enabled here.

But do you need to enable them here, or can it wait until the pci_cmd
write at the end of the function?

> > Also, are the other flags in COMMAND safe at this point - e.g. what
> > about interrupts and stuff?
> > 
> 
> COMMAND registers is saved from stop-and-copy phase, interrupt should be
> disabled, then restoring here when vCPU are not yet running.

Dave

> >> +    interrupt_type = qemu_get_be32(f);
> >> +
> >> +    if (interrupt_type == VFIO_INT_MSI) {
> >> +        /* restore msi configuration */
> >> +        msi_flags = pci_default_read_config(pdev,
> >> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >> +
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> >> +
> >> +        msi_addr_lo = qemu_get_be32(f);
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> >> +                              msi_addr_lo, 4);
> >> +
> >> +        msi_addr_hi = qemu_get_be32(f);
> >> +        if (msi_64bit) {
> >> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >> +                                  msi_addr_hi, 4);
> >> +        }
> >> +        msi_data = qemu_get_be32(f);
> >> +        vfio_pci_write_config(pdev,
> >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >> +                msi_data, 2);
> >> +
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> >> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> >> +        uint16_t offset = qemu_get_be16(f);
> >> +
> >> +        /* load enable bit and maskall bit */
> >> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> >> +                              offset, 2);
> >> +        msix_load(pdev, f);
> >> +    }
> >> +    pci_cmd = qemu_get_be16(f);
> >> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> >> +}
> >> +
> >>  static VFIODeviceOps vfio_pci_ops = {
> >>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
> >>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
> >>      .vfio_eoi = vfio_intx_eoi,
> >>      .vfio_get_object = vfio_pci_get_object,
> >> +    .vfio_save_config = vfio_pci_save_config,
> >> +    .vfio_load_config = vfio_pci_load_config,
> >>  };
> >>  
> >>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 771b6d59a3db..ee72bd984a36 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
> >>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
> >>      void (*vfio_eoi)(VFIODevice *vdev);
> >>      Object *(*vfio_get_object)(VFIODevice *vdev);
> >> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> >> +    void (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
> >>  };
> >>  
> >>  typedef struct VFIOGroup {
> >> -- 
> >> 2.7.0
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-08-22  9:32       ` Dr. David Alan Gilbert
@ 2019-08-22 19:10         ` Kirti Wankhede
  2019-08-22 19:13           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 77+ messages in thread
From: Kirti Wankhede @ 2019-08-22 19:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 8/22/2019 3:02 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> Sorry for delay to respond.
>>
>> On 7/11/2019 5:37 PM, Dr. David Alan Gilbert wrote:
>>> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>>>> These functions save and restore PCI device specific data - config
>>>> space of PCI device.
>>>> Tested save and restore with MSI and MSIX type.
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>  hw/vfio/pci.c                 | 114 ++++++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/vfio/vfio-common.h |   2 +
>>>>  2 files changed, 116 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index de0d286fc9dd..5fe4f8076cac 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -2395,11 +2395,125 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>>>      return OBJECT(vdev);
>>>>  }
>>>>  
>>>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>>>> +{
>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>> +    uint16_t pci_cmd;
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>>>> +        uint32_t bar;
>>>> +
>>>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>>>> +        qemu_put_be32(f, bar);
>>>> +    }
>>>> +
>>>> +    qemu_put_be32(f, vdev->interrupt);
>>>> +    if (vdev->interrupt == VFIO_INT_MSI) {
>>>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>>>> +        bool msi_64bit;
>>>> +
>>>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>>>> +                                            2);
>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>>>> +
>>>> +        msi_addr_lo = pci_default_read_config(pdev,
>>>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>>>> +        qemu_put_be32(f, msi_addr_lo);
>>>> +
>>>> +        if (msi_64bit) {
>>>> +            msi_addr_hi = pci_default_read_config(pdev,
>>>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>>>> +                                             4);
>>>> +        }
>>>> +        qemu_put_be32(f, msi_addr_hi);
>>>> +
>>>> +        msi_data = pci_default_read_config(pdev,
>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>>>> +                2);
>>>> +        qemu_put_be32(f, msi_data);
>>>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
>>>> +        uint16_t offset;
>>>> +
>>>> +        /* save enable bit and maskall bit */
>>>> +        offset = pci_default_read_config(pdev,
>>>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>>>> +        qemu_put_be16(f, offset);
>>>> +        msix_save(pdev, f);
>>>> +    }
>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>>>> +    qemu_put_be16(f, pci_cmd);
>>>> +}
>>>> +
>>>> +static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>>>> +{
>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>> +    uint32_t interrupt_type;
>>>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>>>> +    uint16_t pci_cmd;
>>>> +    bool msi_64bit;
>>>> +    int i;
>>>> +
>>>> +    /* retore pci bar configuration */
>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>>>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>>>> +        uint32_t bar = qemu_get_be32(f);
>>>> +
>>>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>>>> +    }
>>>
>>> Is it possible to validate the bar's at all?  We just had a bug on a
>>> virtual device where one version was asking for a larger bar than the
>>> other; our validation caught this in some cases so we could tell that
>>> the guest had a BAR that was aligned at the wrong alignment.
>>>
>>
>> "Validate the bars" does that means validate size of bars?
> 
> I meant validate the address programmed into the BAR against the size,
> assuming you know the size; e.g. if it's a 128MB BAR, then make sure the
> address programmed in is 128MB aligned.
> 

If this validation fails, migration resume should fail, right?


>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>>>> +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
>>>
>>> Can you explain what this is for?  You write the command register at the
>>> end of the function with the original value; there's no guarantee that
>>> the device is using IO for example, so ORing it seems odd.
>>>
>>
>> IO space and memory space accesses are disabled before writing BAR
>> addresses, only those are enabled here.
> 
> But do you need to enable them here, or can it wait until the pci_cmd
> write at the end of the function?
>

Ok, it can wait.

Thanks,
Kirti


>>> Also, are the other flags in COMMAND safe at this point - e.g. what
>>> about interrupts and stuff?
>>>
>>
>> COMMAND registers is saved from stop-and-copy phase, interrupt should be
>> disabled, then restoring here when vCPU are not yet running.
> 
> Dave
> 
>>>> +    interrupt_type = qemu_get_be32(f);
>>>> +
>>>> +    if (interrupt_type == VFIO_INT_MSI) {
>>>> +        /* restore msi configuration */
>>>> +        msi_flags = pci_default_read_config(pdev,
>>>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>>>> +
>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>>>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>>>> +
>>>> +        msi_addr_lo = qemu_get_be32(f);
>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>>>> +                              msi_addr_lo, 4);
>>>> +
>>>> +        msi_addr_hi = qemu_get_be32(f);
>>>> +        if (msi_64bit) {
>>>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>>>> +                                  msi_addr_hi, 4);
>>>> +        }
>>>> +        msi_data = qemu_get_be32(f);
>>>> +        vfio_pci_write_config(pdev,
>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>>>> +                msi_data, 2);
>>>> +
>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>>>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>>>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>>>> +        uint16_t offset = qemu_get_be16(f);
>>>> +
>>>> +        /* load enable bit and maskall bit */
>>>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>>>> +                              offset, 2);
>>>> +        msix_load(pdev, f);
>>>> +    }
>>>> +    pci_cmd = qemu_get_be16(f);
>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
>>>> +}
>>>> +
>>>>  static VFIODeviceOps vfio_pci_ops = {
>>>>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>>>>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>>>>      .vfio_eoi = vfio_intx_eoi,
>>>>      .vfio_get_object = vfio_pci_get_object,
>>>> +    .vfio_save_config = vfio_pci_save_config,
>>>> +    .vfio_load_config = vfio_pci_load_config,
>>>>  };
>>>>  
>>>>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index 771b6d59a3db..ee72bd984a36 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>>>>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>>>>      void (*vfio_eoi)(VFIODevice *vdev);
>>>>      Object *(*vfio_get_object)(VFIODevice *vdev);
>>>> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
>>>> +    void (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>>>>  };
>>>>  
>>>>  typedef struct VFIOGroup {
>>>> -- 
>>>> 2.7.0
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-08-22 19:10         ` Kirti Wankhede
@ 2019-08-22 19:13           ` Dr. David Alan Gilbert
  2019-08-22 23:57             ` Tian, Kevin
  0 siblings, 1 reply; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-22 19:13 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 8/22/2019 3:02 PM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >> Sorry for delay to respond.
> >>
> >> On 7/11/2019 5:37 PM, Dr. David Alan Gilbert wrote:
> >>> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >>>> These functions save and restore PCI device specific data - config
> >>>> space of PCI device.
> >>>> Tested save and restore with MSI and MSIX type.
> >>>>
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>> ---
> >>>>  hw/vfio/pci.c                 | 114 ++++++++++++++++++++++++++++++++++++++++++
> >>>>  include/hw/vfio/vfio-common.h |   2 +
> >>>>  2 files changed, 116 insertions(+)
> >>>>
> >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>>> index de0d286fc9dd..5fe4f8076cac 100644
> >>>> --- a/hw/vfio/pci.c
> >>>> +++ b/hw/vfio/pci.c
> >>>> @@ -2395,11 +2395,125 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >>>>      return OBJECT(vdev);
> >>>>  }
> >>>>  
> >>>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> >>>> +{
> >>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >>>> +    PCIDevice *pdev = &vdev->pdev;
> >>>> +    uint16_t pci_cmd;
> >>>> +    int i;
> >>>> +
> >>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >>>> +        uint32_t bar;
> >>>> +
> >>>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> >>>> +        qemu_put_be32(f, bar);
> >>>> +    }
> >>>> +
> >>>> +    qemu_put_be32(f, vdev->interrupt);
> >>>> +    if (vdev->interrupt == VFIO_INT_MSI) {
> >>>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >>>> +        bool msi_64bit;
> >>>> +
> >>>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >>>> +                                            2);
> >>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >>>> +
> >>>> +        msi_addr_lo = pci_default_read_config(pdev,
> >>>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> >>>> +        qemu_put_be32(f, msi_addr_lo);
> >>>> +
> >>>> +        if (msi_64bit) {
> >>>> +            msi_addr_hi = pci_default_read_config(pdev,
> >>>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >>>> +                                             4);
> >>>> +        }
> >>>> +        qemu_put_be32(f, msi_addr_hi);
> >>>> +
> >>>> +        msi_data = pci_default_read_config(pdev,
> >>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >>>> +                2);
> >>>> +        qemu_put_be32(f, msi_data);
> >>>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> >>>> +        uint16_t offset;
> >>>> +
> >>>> +        /* save enable bit and maskall bit */
> >>>> +        offset = pci_default_read_config(pdev,
> >>>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> >>>> +        qemu_put_be16(f, offset);
> >>>> +        msix_save(pdev, f);
> >>>> +    }
> >>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >>>> +    qemu_put_be16(f, pci_cmd);
> >>>> +}
> >>>> +
> >>>> +static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> >>>> +{
> >>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >>>> +    PCIDevice *pdev = &vdev->pdev;
> >>>> +    uint32_t interrupt_type;
> >>>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >>>> +    uint16_t pci_cmd;
> >>>> +    bool msi_64bit;
> >>>> +    int i;
> >>>> +
> >>>> +    /* retore pci bar configuration */
> >>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >>>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> >>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >>>> +        uint32_t bar = qemu_get_be32(f);
> >>>> +
> >>>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> >>>> +    }
> >>>
> >>> Is it possible to validate the bar's at all?  We just had a bug on a
> >>> virtual device where one version was asking for a larger bar than the
> >>> other; our validation caught this in some cases so we could tell that
> >>> the guest had a BAR that was aligned at the wrong alignment.
> >>>
> >>
> >> "Validate the bars" does that means validate size of bars?
> > 
> > I meant validate the address programmed into the BAR against the size,
> > assuming you know the size; e.g. if it's a 128MB BAR, then make sure the
> > address programmed in is 128MB aligned.
> > 
> 
> If this validation fails, migration resume should fail, right?

Yes I think so; if you've got a device that wants 128MB alignment and
someone gives you a non-aligned address, who knows what will happen.

> 
> >>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >>>> +                          pci_cmd | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> >>>
> >>> Can you explain what this is for?  You write the command register at the
> >>> end of the function with the original value; there's no guarantee that
> >>> the device is using IO for example, so ORing it seems odd.
> >>>
> >>
> >> IO space and memory space accesses are disabled before writing BAR
> >> addresses, only those are enabled here.
> > 
> > But do you need to enable them here, or can it wait until the pci_cmd
> > write at the end of the function?
> >
> 
> Ok, it can wait.

Great.

Dave

> Thanks,
> Kirti
> 
> 
> >>> Also, are the other flags in COMMAND safe at this point - e.g. what
> >>> about interrupts and stuff?
> >>>
> >>
> >> COMMAND registers is saved from stop-and-copy phase, interrupt should be
> >> disabled, then restoring here when vCPU are not yet running.
> > 
> > Dave
> > 
> >>>> +    interrupt_type = qemu_get_be32(f);
> >>>> +
> >>>> +    if (interrupt_type == VFIO_INT_MSI) {
> >>>> +        /* restore msi configuration */
> >>>> +        msi_flags = pci_default_read_config(pdev,
> >>>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> >>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >>>> +
> >>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >>>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> >>>> +
> >>>> +        msi_addr_lo = qemu_get_be32(f);
> >>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> >>>> +                              msi_addr_lo, 4);
> >>>> +
> >>>> +        msi_addr_hi = qemu_get_be32(f);
> >>>> +        if (msi_64bit) {
> >>>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >>>> +                                  msi_addr_hi, 4);
> >>>> +        }
> >>>> +        msi_data = qemu_get_be32(f);
> >>>> +        vfio_pci_write_config(pdev,
> >>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >>>> +                msi_data, 2);
> >>>> +
> >>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >>>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> >>>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> >>>> +        uint16_t offset = qemu_get_be16(f);
> >>>> +
> >>>> +        /* load enable bit and maskall bit */
> >>>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> >>>> +                              offset, 2);
> >>>> +        msix_load(pdev, f);
> >>>> +    }
> >>>> +    pci_cmd = qemu_get_be16(f);
> >>>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> >>>> +}
> >>>> +
> >>>>  static VFIODeviceOps vfio_pci_ops = {
> >>>>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
> >>>>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
> >>>>      .vfio_eoi = vfio_intx_eoi,
> >>>>      .vfio_get_object = vfio_pci_get_object,
> >>>> +    .vfio_save_config = vfio_pci_save_config,
> >>>> +    .vfio_load_config = vfio_pci_load_config,
> >>>>  };
> >>>>  
> >>>>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> >>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>>> index 771b6d59a3db..ee72bd984a36 100644
> >>>> --- a/include/hw/vfio/vfio-common.h
> >>>> +++ b/include/hw/vfio/vfio-common.h
> >>>> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
> >>>>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
> >>>>      void (*vfio_eoi)(VFIODevice *vdev);
> >>>>      Object *(*vfio_get_object)(VFIODevice *vdev);
> >>>> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> >>>> +    void (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
> >>>>  };
> >>>>  
> >>>>  typedef struct VFIOGroup {
> >>>> -- 
> >>>> 2.7.0
> >>>>
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-08-22 19:13           ` Dr. David Alan Gilbert
@ 2019-08-22 23:57             ` Tian, Kevin
  2019-08-23  9:26               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 77+ messages in thread
From: Tian, Kevin @ 2019-08-22 23:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Kirti Wankhede
  Cc: Zhengxiao.zx, Liu, Yi L, cjia, eskultet, Yang, Ziye, cohuck,
	shuangtai.tst, qemu-devel, Wang,  Zhi A, mlevitsk, pasic, aik,
	alex.williamson, eauger, felipe, jonathan.davies, Zhao, Yan Y,
	Liu, Changpeng, Ken.Xue

> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> Sent: Friday, August 23, 2019 3:13 AM
> 
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >
> >
> > On 8/22/2019 3:02 PM, Dr. David Alan Gilbert wrote:
> > > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > >> Sorry for delay to respond.
> > >>
> > >> On 7/11/2019 5:37 PM, Dr. David Alan Gilbert wrote:
> > >>> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > >>>> These functions save and restore PCI device specific data - config
> > >>>> space of PCI device.
> > >>>> Tested save and restore with MSI and MSIX type.
> > >>>>
> > >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >>>> ---
> > >>>>  hw/vfio/pci.c                 | 114
> ++++++++++++++++++++++++++++++++++++++++++
> > >>>>  include/hw/vfio/vfio-common.h |   2 +
> > >>>>  2 files changed, 116 insertions(+)
> > >>>>
> > >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > >>>> index de0d286fc9dd..5fe4f8076cac 100644
> > >>>> --- a/hw/vfio/pci.c
> > >>>> +++ b/hw/vfio/pci.c
> > >>>> @@ -2395,11 +2395,125 @@ static Object
> *vfio_pci_get_object(VFIODevice *vbasedev)
> > >>>>      return OBJECT(vdev);
> > >>>>  }
> > >>>>
> > >>>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > >>>> +{
> > >>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
> vbasedev);
> > >>>> +    PCIDevice *pdev = &vdev->pdev;
> > >>>> +    uint16_t pci_cmd;
> > >>>> +    int i;
> > >>>> +
> > >>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > >>>> +        uint32_t bar;
> > >>>> +
> > >>>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i *
> 4, 4);
> > >>>> +        qemu_put_be32(f, bar);
> > >>>> +    }
> > >>>> +
> > >>>> +    qemu_put_be32(f, vdev->interrupt);
> > >>>> +    if (vdev->interrupt == VFIO_INT_MSI) {
> > >>>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > >>>> +        bool msi_64bit;
> > >>>> +
> > >>>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap +
> PCI_MSI_FLAGS,
> > >>>> +                                            2);
> > >>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > >>>> +
> > >>>> +        msi_addr_lo = pci_default_read_config(pdev,
> > >>>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > >>>> +        qemu_put_be32(f, msi_addr_lo);
> > >>>> +
> > >>>> +        if (msi_64bit) {
> > >>>> +            msi_addr_hi = pci_default_read_config(pdev,
> > >>>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > >>>> +                                             4);
> > >>>> +        }
> > >>>> +        qemu_put_be32(f, msi_addr_hi);
> > >>>> +
> > >>>> +        msi_data = pci_default_read_config(pdev,
> > >>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 :
> PCI_MSI_DATA_32),
> > >>>> +                2);
> > >>>> +        qemu_put_be32(f, msi_data);
> > >>>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> > >>>> +        uint16_t offset;
> > >>>> +
> > >>>> +        /* save enable bit and maskall bit */
> > >>>> +        offset = pci_default_read_config(pdev,
> > >>>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > >>>> +        qemu_put_be16(f, offset);
> > >>>> +        msix_save(pdev, f);
> > >>>> +    }
> > >>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > >>>> +    qemu_put_be16(f, pci_cmd);
> > >>>> +}
> > >>>> +
> > >>>> +static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > >>>> +{
> > >>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
> vbasedev);
> > >>>> +    PCIDevice *pdev = &vdev->pdev;
> > >>>> +    uint32_t interrupt_type;
> > >>>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > >>>> +    uint16_t pci_cmd;
> > >>>> +    bool msi_64bit;
> > >>>> +    int i;
> > >>>> +
> > >>>> +    /* retore pci bar configuration */
> > >>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > >>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > >>>> +                        pci_cmd & (!(PCI_COMMAND_IO |
> PCI_COMMAND_MEMORY)), 2);
> > >>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > >>>> +        uint32_t bar = qemu_get_be32(f);
> > >>>> +
> > >>>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > >>>> +    }
> > >>>
> > >>> Is it possible to validate the bar's at all?  We just had a bug on a
> > >>> virtual device where one version was asking for a larger bar than the
> > >>> other; our validation caught this in some cases so we could tell that
> > >>> the guest had a BAR that was aligned at the wrong alignment.

I'm a bit confused here. Did you mean that src and dest include
different versions of the virtual device which implements different
BAR size? If that is the case, shouldn't the migration fail at the start
when doing compatibility check?

> > >>>
> > >>
> > >> "Validate the bars" does that means validate size of bars?
> > >
> > > I meant validate the address programmed into the BAR against the size,
> > > assuming you know the size; e.g. if it's a 128MB BAR, then make sure the
> > > address programmed in is 128MB aligned.
> > >
> >
> > If this validation fails, migration resume should fail, right?
> 
> Yes I think so; if you've got a device that wants 128MB alignment and
> someone gives you a non-aligned address, who knows what will happen.

If misalignment is really caused by the guest, shouldn't we just follow
the hardware behavior, i.e. hard-wiring the lower bits to 0 before
updating the cfg space? 

Thanks
Kevin


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 07/13] vfio: Add migration state change notifier
  2019-08-20 20:24     ` Kirti Wankhede
@ 2019-08-23  0:54       ` Yan Zhao
  0 siblings, 0 replies; 77+ messages in thread
From: Yan Zhao @ 2019-08-23  0:54 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Wed, Aug 21, 2019 at 04:24:27AM +0800, Kirti Wankhede wrote:
> 
> 
> On 7/17/2019 7:55 AM, Yan Zhao wrote:
> > On Tue, Jul 09, 2019 at 05:49:14PM +0800, Kirti Wankhede wrote:
> >> Added migration state change notifier to get notification on migration state
> >> change. These states are translated to VFIO device state and conveyed to vendor
> >> driver.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/migration.c           | 54 +++++++++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/trace-events          |  1 +
> >>  include/hw/vfio/vfio-common.h |  1 +
> >>  3 files changed, 56 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index c01f08b659d0..e4a89a6f9bc7 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -132,6 +132,53 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >>      }
> >>  }
> >>  
> >> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> >> +{
> >> +    MigrationState *s = data;
> >> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> >> +    int ret;
> >> +
> >> +    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
> >> +
> >> +    switch (s->state) {
> >> +    case MIGRATION_STATUS_ACTIVE:
> >> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING) {
> >> +            if (vbasedev->vm_running) {
> >> +                ret = vfio_migration_set_state(vbasedev,
> >> +                          VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
> >> +                if (ret) {
> >> +                    error_report("%s: Failed to set state RUNNING and SAVING",
> >> +                                  vbasedev->name);
> >> +                }
> >> +            } else {
> >> +                ret = vfio_migration_set_state(vbasedev,
> >> +                                               VFIO_DEVICE_STATE_SAVING);
> >> +                if (ret) {
> >> +                    error_report("%s: Failed to set state STOP and SAVING",
> >> +                                 vbasedev->name);
> >> +                }
> >> +            }
> >> +        } else {
> >> +            ret = vfio_migration_set_state(vbasedev,
> >> +                                           VFIO_DEVICE_STATE_RESUMING);
> >> +            if (ret) {
> >> +                error_report("%s: Failed to set state RESUMING",
> >> +                             vbasedev->name);
> >> +            }
> >> +        }
> >> +        return;
> >> +
> > hi Kirti
> > currently, migration state notifiers are only notified in below 3 interfaces:
> > migrate_fd_connect, migrate_fd_cleanup, postcopy_start, where
> > MIGRATION_STATUS_ACTIVE is not an valid state.
> > Have you tested the above code? what's the purpose of the code?
> > 
> 
> Sorry for delayed response.
> 
> migration_iteration_finish() -> qemu_bh_schedule(s->cleanup_bh) which is
> migrate_fd_cleanup().
> 
> migration_iteration_finish() can be called with MIGRATION_STATUS_ACTIVE
> state. So migration state notifiers can be called with
> MIGRATION_STATUS_ACTIVE. So handled that case here.
>
hi Kirti

I checked the code, the MIGRATION_STATUS_ACTIVE case you mentioned is
colo only, and there's actually an assert in migrate_fd_cleanup

	assert((s->state != MIGRATION_STATUS_ACTIVE) &&
		(s->state != MIGRATION_STATUS_POSTCOPY_ACTIVE));

before it calls notifier_list_notify(&migration_state_notifiers, s).

Thanks
Yan

> 
> 
> > 
> >> +    case MIGRATION_STATUS_CANCELLING:
> >> +    case MIGRATION_STATUS_CANCELLED:
> >> +    case MIGRATION_STATUS_FAILED:
> >> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> >> +        }
> >> +        return;
> >> +    }
> >> +}
> >> +
> >>  static int vfio_migration_init(VFIODevice *vbasedev,
> >>                                 struct vfio_region_info *info)
> >>  {
> >> @@ -152,6 +199,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >>                                                            vbasedev);
> >>  
> >> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> >> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> >> +
> >>      return 0;
> >>  }
> >>  
> >> @@ -194,6 +244,10 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
> >>          return;
> >>      }
> >>  
> >> +    if (vbasedev->migration_state.notify) {
> >> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> >> +    }
> >> +
> >>      if (vbasedev->vm_state) {
> >>          qemu_del_vm_change_state_handler(vbasedev->vm_state);
> >>      }
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index 3d15bacd031a..69503228f20e 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -148,3 +148,4 @@ vfio_display_edid_write_error(void) ""
> >>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> >>  vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> >>  vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >> +vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index f6c70db3a9c1..a022484d2636 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -128,6 +128,7 @@ typedef struct VFIODevice {
> >>      uint32_t device_state;
> >>      VMChangeStateEntry *vm_state;
> >>      int vm_running;
> >> +    Notifier migration_state;
> >>  } VFIODevice;
> >>  
> >>  struct VFIODeviceOps {
> >> -- 
> >> 2.7.0
> >>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 08/13] vfio: Register SaveVMHandlers for VFIO device
  2019-08-20 20:33     ` Kirti Wankhede
@ 2019-08-23  1:23       ` Yan Zhao
  0 siblings, 0 replies; 77+ messages in thread
From: Yan Zhao @ 2019-08-23  1:23 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Wed, Aug 21, 2019 at 04:33:06AM +0800, Kirti Wankhede wrote:
> 
> 
> On 7/22/2019 2:04 PM, Yan Zhao wrote:
> > On Tue, Jul 09, 2019 at 05:49:15PM +0800, Kirti Wankhede wrote:
> >> Define flags to be used as delimeter in migration file stream.
> >> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> >> region from these functions at source during saving or pre-copy phase.
> >> Set VFIO device state depending on VM's state. During live migration, VM is
> >> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> >> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/migration.c  | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> >>  hw/vfio/trace-events |  2 ++
> >>  2 files changed, 83 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index e4a89a6f9bc7..0597a45fda2d 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -23,6 +23,17 @@
> >>  #include "pci.h"
> >>  #include "trace.h"
> >>  
> >> +/*
> >> + * Flags used as delimiter:
> >> + * 0xffffffff => MSB 32-bit all 1s
> >> + * 0xef10     => emulated (virtual) function IO
> >> + * 0x0000     => 16-bits reserved for flags
> >> + */
> >> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> >> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> >> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> >> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> >> +
> >>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
> >>  {
> >>      VFIOMigration *migration = vbasedev->migration;
> >> @@ -106,6 +117,74 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> >>      return 0;
> >>  }
> >>  
> >> +/* ---------------------------------------------------------------------- */
> >> +
> >> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret;
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> >> +
> >> +    if (migration->region.buffer.mmaps) {
> >> +        qemu_mutex_lock_iothread();
> >> +        ret = vfio_region_mmap(&migration->region.buffer);
> >> +        qemu_mutex_unlock_iothread();
> >> +        if (ret) {
> >> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> >> +                         vbasedev->name, migration->region.index,
> >> +                         strerror(-ret));
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    if (vbasedev->vm_running) {
> >> +        ret = vfio_migration_set_state(vbasedev,
> >> +                         VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to set state RUNNING and SAVING",
> >> +                         vbasedev->name);
> >> +            return ret;
> >> +        }
> >> +    } else {
> > hi Kirti
> > May I know in which condition will this "else" case happen?
> > 
> 
> This can happen in savevm case.

ok. I see it. thanks.
Could we simplify the logic and only or VFIO_DEVICE_STATE_SAVING to
current device state here?
Because device state was already set to RUNNING or STOP in
vfio_vmstate_change().

Thanks
Yan
> 
> Thanks,
> Kirti
> 
> > Thanks
> > Yan
> > 
> >> +        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to set state STOP and SAVING",
> >> +                         vbasedev->name);
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    trace_vfio_save_setup(vbasedev->name);
> >> +    return 0;
> >> +}
> >> +
> >> +static void vfio_save_cleanup(void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +
> >> +    if (migration->region.buffer.mmaps) {
> >> +        vfio_region_unmap(&migration->region.buffer);
> >> +    }
> >> +    trace_vfio_save_cleanup(vbasedev->name);
> >> +}
> >> +
> >> +static SaveVMHandlers savevm_vfio_handlers = {
> >> +    .save_setup = vfio_save_setup,
> >> +    .save_cleanup = vfio_save_cleanup,
> >> +};
> >> +
> >> +/* ---------------------------------------------------------------------- */
> >> +
> >>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >>  {
> >>      VFIODevice *vbasedev = opaque;
> >> @@ -195,7 +274,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >>      }
> >>  
> >>      qemu_mutex_init(&vbasedev->migration->lock);
> >> -
> >> +    register_savevm_live(vbasedev->dev, "vfio", -1, 1, &savevm_vfio_handlers,
> >> +                         vbasedev);
> >>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >>                                                            vbasedev);
> >>  
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index 69503228f20e..4bb43f18f315 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> >>  vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> >>  vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
> >> +vfio_save_setup(char *name) " (%s)"
> >> +vfio_save_cleanup(char *name) " (%s)"
> >> -- 
> >> 2.7.0
> >>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM
  2019-08-20 20:33     ` Kirti Wankhede
@ 2019-08-23  1:32       ` Yan Zhao
  0 siblings, 0 replies; 77+ messages in thread
From: Yan Zhao @ 2019-08-23  1:32 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Wed, Aug 21, 2019 at 04:33:50AM +0800, Kirti Wankhede wrote:
> 
> 
> On 7/22/2019 2:07 PM, Yan Zhao wrote:
> > On Tue, Jul 09, 2019 at 05:49:13PM +0800, Kirti Wankhede wrote:
> >> VM state change handler gets called on change in VM's state. This is used to set
> >> VFIO device state to _RUNNING.
> >> VM state change handler, migration state change handler and log_sync listener
> >> are called asynchronously, which sometimes lead to data corruption in migration
> >> region. Initialised mutex that is used to serialize operations on migration data
> >> region during saving state.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  hw/vfio/migration.c           | 64 +++++++++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/trace-events          |  2 ++
> >>  include/hw/vfio/vfio-common.h |  4 +++
> >>  3 files changed, 70 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index a2cfbd5af2e1..c01f08b659d0 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -78,6 +78,60 @@ err:
> >>      return ret;
> >>  }
> >>  
> >> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region.buffer;
> >> +    uint32_t device_state;
> >> +    int ret = 0;
> >> +
> >> +    device_state = (state & VFIO_DEVICE_STATE_MASK) |
> >> +                   (vbasedev->device_state & ~VFIO_DEVICE_STATE_MASK);
> >> +
> >> +    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
> >> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                              device_state));
> >> +    if (ret < 0) {
> >> +        error_report("%s: Failed to set device state %d %s",
> >> +                     vbasedev->name, ret, strerror(errno));
> >> +        return ret;
> >> +    }
> >> +
> >> +    vbasedev->device_state = device_state;
> >> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> >> +    return 0;
> >> +}
> >> +
> >> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    if ((vbasedev->vm_running != running)) {
> >> +        int ret;
> >> +        uint32_t dev_state;
> >> +
> >> +        if (running) {
> >> +            dev_state = VFIO_DEVICE_STATE_RUNNING;
> > should be
> > dev_state |= VFIO_DEVICE_STATE_RUNNING; ?
> > 
> 
> vfio_migration_set_state() takes case of ORing.
>
if previous dev_state is VFIO_DEVICE_STATE_SAVING (without RUNNING), and
vfio_migration_set_state(VFIO_DEVICE_STATE_RUNNING) is called here, do
you mean vfio_migration_set_state() will change the device state to
VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING ?

Thanks
Yan


> Thanks,
> Kirti
> 
> >> +        } else {
> >> +            dev_state = (vbasedev->device_state & VFIO_DEVICE_STATE_MASK) &
> >> +                     ~VFIO_DEVICE_STATE_RUNNING;
> >> +        }
> >> +
> >> +        ret = vfio_migration_set_state(vbasedev, dev_state);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to set device state 0x%x",
> >> +                         vbasedev->name, dev_state);
> >> +        }
> >> +        vbasedev->vm_running = running;
> >> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> >> +                                  dev_state);
> >> +    }
> >> +}
> >> +
> >>  static int vfio_migration_init(VFIODevice *vbasedev,
> >>                                 struct vfio_region_info *info)
> >>  {
> >> @@ -93,6 +147,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >>          return ret;
> >>      }
> >>  
> >> +    qemu_mutex_init(&vbasedev->migration->lock);
> >> +
> >> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >> +                                                          vbasedev);
> >> +
> >>      return 0;
> >>  }
> >>  
> >> @@ -135,11 +194,16 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
> >>          return;
> >>      }
> >>  
> >> +    if (vbasedev->vm_state) {
> >> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> >> +    }
> >> +
> >>      if (vbasedev->migration_blocker) {
> >>          migrate_del_blocker(vbasedev->migration_blocker);
> >>          error_free(vbasedev->migration_blocker);
> >>      }
> >>  
> >> +    qemu_mutex_destroy(&vbasedev->migration->lock);
> >>      vfio_migration_region_exit(vbasedev);
> >>      g_free(vbasedev->migration);
> >>  }
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index 191a726a1312..3d15bacd031a 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
> >>  
> >>  # migration.c
> >>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> >> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> >> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 152da3f8d6f3..f6c70db3a9c1 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -29,6 +29,7 @@
> >>  #ifdef CONFIG_LINUX
> >>  #include <linux/vfio.h>
> >>  #endif
> >> +#include "sysemu/sysemu.h"
> >>  
> >>  #define VFIO_MSG_PREFIX "vfio %s: "
> >>  
> >> @@ -124,6 +125,9 @@ typedef struct VFIODevice {
> >>      unsigned int flags;
> >>      VFIOMigration *migration;
> >>      Error *migration_blocker;
> >> +    uint32_t device_state;
> >> +    VMChangeStateEntry *vm_state;
> >> +    int vm_running;
> >>  } VFIODevice;
> >>  
> >>  struct VFIODeviceOps {
> >> -- 
> >> 2.7.0
> >>
> 
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information.  Any unauthorized review, use, disclosure or distribution
> is prohibited.  If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-08-22 23:57             ` Tian, Kevin
@ 2019-08-23  9:26               ` Dr. David Alan Gilbert
  2019-08-23  9:49                 ` Tian, Kevin
  0 siblings, 1 reply; 77+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-23  9:26 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, alex.williamson, Wang, Zhi A, mlevitsk,
	pasic, aik, Kirti Wankhede, eauger, felipe, jonathan.davies,
	Zhao, Yan Y, Liu, Changpeng, Ken.Xue

* Tian, Kevin (kevin.tian@intel.com) wrote:
> > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > Sent: Friday, August 23, 2019 3:13 AM
> > 
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > >
> > >
> > > On 8/22/2019 3:02 PM, Dr. David Alan Gilbert wrote:
> > > > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > >> Sorry for delay to respond.
> > > >>
> > > >> On 7/11/2019 5:37 PM, Dr. David Alan Gilbert wrote:
> > > >>> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > >>>> These functions save and restore PCI device specific data - config
> > > >>>> space of PCI device.
> > > >>>> Tested save and restore with MSI and MSIX type.
> > > >>>>
> > > >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > >>>> ---
> > > >>>>  hw/vfio/pci.c                 | 114
> > ++++++++++++++++++++++++++++++++++++++++++
> > > >>>>  include/hw/vfio/vfio-common.h |   2 +
> > > >>>>  2 files changed, 116 insertions(+)
> > > >>>>
> > > >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > >>>> index de0d286fc9dd..5fe4f8076cac 100644
> > > >>>> --- a/hw/vfio/pci.c
> > > >>>> +++ b/hw/vfio/pci.c
> > > >>>> @@ -2395,11 +2395,125 @@ static Object
> > *vfio_pci_get_object(VFIODevice *vbasedev)
> > > >>>>      return OBJECT(vdev);
> > > >>>>  }
> > > >>>>
> > > >>>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > > >>>> +{
> > > >>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
> > vbasedev);
> > > >>>> +    PCIDevice *pdev = &vdev->pdev;
> > > >>>> +    uint16_t pci_cmd;
> > > >>>> +    int i;
> > > >>>> +
> > > >>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > >>>> +        uint32_t bar;
> > > >>>> +
> > > >>>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i *
> > 4, 4);
> > > >>>> +        qemu_put_be32(f, bar);
> > > >>>> +    }
> > > >>>> +
> > > >>>> +    qemu_put_be32(f, vdev->interrupt);
> > > >>>> +    if (vdev->interrupt == VFIO_INT_MSI) {
> > > >>>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > >>>> +        bool msi_64bit;
> > > >>>> +
> > > >>>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap +
> > PCI_MSI_FLAGS,
> > > >>>> +                                            2);
> > > >>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > > >>>> +
> > > >>>> +        msi_addr_lo = pci_default_read_config(pdev,
> > > >>>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > > >>>> +        qemu_put_be32(f, msi_addr_lo);
> > > >>>> +
> > > >>>> +        if (msi_64bit) {
> > > >>>> +            msi_addr_hi = pci_default_read_config(pdev,
> > > >>>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > >>>> +                                             4);
> > > >>>> +        }
> > > >>>> +        qemu_put_be32(f, msi_addr_hi);
> > > >>>> +
> > > >>>> +        msi_data = pci_default_read_config(pdev,
> > > >>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 :
> > PCI_MSI_DATA_32),
> > > >>>> +                2);
> > > >>>> +        qemu_put_be32(f, msi_data);
> > > >>>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> > > >>>> +        uint16_t offset;
> > > >>>> +
> > > >>>> +        /* save enable bit and maskall bit */
> > > >>>> +        offset = pci_default_read_config(pdev,
> > > >>>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > > >>>> +        qemu_put_be16(f, offset);
> > > >>>> +        msix_save(pdev, f);
> > > >>>> +    }
> > > >>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > >>>> +    qemu_put_be16(f, pci_cmd);
> > > >>>> +}
> > > >>>> +
> > > >>>> +static void vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > > >>>> +{
> > > >>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
> > vbasedev);
> > > >>>> +    PCIDevice *pdev = &vdev->pdev;
> > > >>>> +    uint32_t interrupt_type;
> > > >>>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > >>>> +    uint16_t pci_cmd;
> > > >>>> +    bool msi_64bit;
> > > >>>> +    int i;
> > > >>>> +
> > > >>>> +    /* retore pci bar configuration */
> > > >>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > >>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > >>>> +                        pci_cmd & (!(PCI_COMMAND_IO |
> > PCI_COMMAND_MEMORY)), 2);
> > > >>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > >>>> +        uint32_t bar = qemu_get_be32(f);
> > > >>>> +
> > > >>>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > > >>>> +    }
> > > >>>
> > > >>> Is it possible to validate the bar's at all?  We just had a bug on a
> > > >>> virtual device where one version was asking for a larger bar than the
> > > >>> other; our validation caught this in some cases so we could tell that
> > > >>> the guest had a BAR that was aligned at the wrong alignment.
> 
> I'm a bit confused here. Did you mean that src and dest include
> different versions of the virtual device which implements different
> BAR size? If that is the case, shouldn't the migration fail at the start
> when doing compatibility check?

It was a mistake where the destination had accidentally changed the BAR
size; checking the alignment was the only check that failed.

> > > >>>
> > > >>
> > > >> "Validate the bars" does that means validate size of bars?
> > > >
> > > > I meant validate the address programmed into the BAR against the size,
> > > > assuming you know the size; e.g. if it's a 128MB BAR, then make sure the
> > > > address programmed in is 128MB aligned.
> > > >
> > >
> > > If this validation fails, migration resume should fail, right?
> > 
> > Yes I think so; if you've got a device that wants 128MB alignment and
> > someone gives you a non-aligned address, who knows what will happen.
> 
> If misalignment is really caused by the guest, shouldn't we just follow
> the hardware behavior, i.e. hard-wiring the lower bits to 0 before
> updating the cfg space? 

That should already happen on the source; but when loading a migration
stream I try and be very untrusting; so it's good to check that the
destination devices idea of the BAR matches what the register has.

Dave

> Thanks
> Kevin
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-08-23  9:26               ` Dr. David Alan Gilbert
@ 2019-08-23  9:49                 ` Tian, Kevin
  0 siblings, 0 replies; 77+ messages in thread
From: Tian, Kevin @ 2019-08-23  9:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang,  Ziye,
	cohuck, shuangtai.tst, alex.williamson, Wang,  Zhi A, mlevitsk,
	pasic, aik, Kirti Wankhede, eauger, felipe, jonathan.davies,
	Zhao, Yan Y, Liu, Changpeng, Ken.Xue

> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> Sent: Friday, August 23, 2019 5:27 PM
> 
> * Tian, Kevin (kevin.tian@intel.com) wrote:
> > > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > > Sent: Friday, August 23, 2019 3:13 AM
> > >
> > > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > >
> > > >
> > > > On 8/22/2019 3:02 PM, Dr. David Alan Gilbert wrote:
> > > > > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > > >> Sorry for delay to respond.
> > > > >>
> > > > >> On 7/11/2019 5:37 PM, Dr. David Alan Gilbert wrote:
> > > > >>> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > > >>>> These functions save and restore PCI device specific data - config
> > > > >>>> space of PCI device.
> > > > >>>> Tested save and restore with MSI and MSIX type.
> > > > >>>>
> > > > >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > >>>> ---
> > > > >>>>  hw/vfio/pci.c                 | 114
> > > ++++++++++++++++++++++++++++++++++++++++++
> > > > >>>>  include/hw/vfio/vfio-common.h |   2 +
> > > > >>>>  2 files changed, 116 insertions(+)
> > > > >>>>
> > > > >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > >>>> index de0d286fc9dd..5fe4f8076cac 100644
> > > > >>>> --- a/hw/vfio/pci.c
> > > > >>>> +++ b/hw/vfio/pci.c
> > > > >>>> @@ -2395,11 +2395,125 @@ static Object
> > > *vfio_pci_get_object(VFIODevice *vbasedev)
> > > > >>>>      return OBJECT(vdev);
> > > > >>>>  }
> > > > >>>>
> > > > >>>> +static void vfio_pci_save_config(VFIODevice *vbasedev,
> QEMUFile *f)
> > > > >>>> +{
> > > > >>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
> > > vbasedev);
> > > > >>>> +    PCIDevice *pdev = &vdev->pdev;
> > > > >>>> +    uint16_t pci_cmd;
> > > > >>>> +    int i;
> > > > >>>> +
> > > > >>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > > >>>> +        uint32_t bar;
> > > > >>>> +
> > > > >>>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 +
> i *
> > > 4, 4);
> > > > >>>> +        qemu_put_be32(f, bar);
> > > > >>>> +    }
> > > > >>>> +
> > > > >>>> +    qemu_put_be32(f, vdev->interrupt);
> > > > >>>> +    if (vdev->interrupt == VFIO_INT_MSI) {
> > > > >>>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > > >>>> +        bool msi_64bit;
> > > > >>>> +
> > > > >>>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap +
> > > PCI_MSI_FLAGS,
> > > > >>>> +                                            2);
> > > > >>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > > > >>>> +
> > > > >>>> +        msi_addr_lo = pci_default_read_config(pdev,
> > > > >>>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > > > >>>> +        qemu_put_be32(f, msi_addr_lo);
> > > > >>>> +
> > > > >>>> +        if (msi_64bit) {
> > > > >>>> +            msi_addr_hi = pci_default_read_config(pdev,
> > > > >>>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > > >>>> +                                             4);
> > > > >>>> +        }
> > > > >>>> +        qemu_put_be32(f, msi_addr_hi);
> > > > >>>> +
> > > > >>>> +        msi_data = pci_default_read_config(pdev,
> > > > >>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 :
> > > PCI_MSI_DATA_32),
> > > > >>>> +                2);
> > > > >>>> +        qemu_put_be32(f, msi_data);
> > > > >>>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> > > > >>>> +        uint16_t offset;
> > > > >>>> +
> > > > >>>> +        /* save enable bit and maskall bit */
> > > > >>>> +        offset = pci_default_read_config(pdev,
> > > > >>>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > > > >>>> +        qemu_put_be16(f, offset);
> > > > >>>> +        msix_save(pdev, f);
> > > > >>>> +    }
> > > > >>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > > >>>> +    qemu_put_be16(f, pci_cmd);
> > > > >>>> +}
> > > > >>>> +
> > > > >>>> +static void vfio_pci_load_config(VFIODevice *vbasedev,
> QEMUFile *f)
> > > > >>>> +{
> > > > >>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
> > > vbasedev);
> > > > >>>> +    PCIDevice *pdev = &vdev->pdev;
> > > > >>>> +    uint32_t interrupt_type;
> > > > >>>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > > >>>> +    uint16_t pci_cmd;
> > > > >>>> +    bool msi_64bit;
> > > > >>>> +    int i;
> > > > >>>> +
> > > > >>>> +    /* retore pci bar configuration */
> > > > >>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > > >>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > > >>>> +                        pci_cmd & (!(PCI_COMMAND_IO |
> > > PCI_COMMAND_MEMORY)), 2);
> > > > >>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > > >>>> +        uint32_t bar = qemu_get_be32(f);
> > > > >>>> +
> > > > >>>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar,
> 4);
> > > > >>>> +    }
> > > > >>>
> > > > >>> Is it possible to validate the bar's at all?  We just had a bug on a
> > > > >>> virtual device where one version was asking for a larger bar than the
> > > > >>> other; our validation caught this in some cases so we could tell that
> > > > >>> the guest had a BAR that was aligned at the wrong alignment.
> >
> > I'm a bit confused here. Did you mean that src and dest include
> > different versions of the virtual device which implements different
> > BAR size? If that is the case, shouldn't the migration fail at the start
> > when doing compatibility check?
> 
> It was a mistake where the destination had accidentally changed the BAR
> size; checking the alignment was the only check that failed.
> 
> > > > >>>
> > > > >>
> > > > >> "Validate the bars" does that means validate size of bars?
> > > > >
> > > > > I meant validate the address programmed into the BAR against the size,
> > > > > assuming you know the size; e.g. if it's a 128MB BAR, then make sure
> the
> > > > > address programmed in is 128MB aligned.
> > > > >
> > > >
> > > > If this validation fails, migration resume should fail, right?
> > >
> > > Yes I think so; if you've got a device that wants 128MB alignment and
> > > someone gives you a non-aligned address, who knows what will happen.
> >
> > If misalignment is really caused by the guest, shouldn't we just follow
> > the hardware behavior, i.e. hard-wiring the lower bits to 0 before
> > updating the cfg space?
> 
> That should already happen on the source; but when loading a migration
> stream I try and be very untrusting; so it's good to check that the
> destination devices idea of the BAR matches what the register has.
> 

OK, it makes sense. 

Thanks
Kevin


^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2019-08-23  9:50 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-09  9:49 [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Kirti Wankhede
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 01/13] vfio: KABI for migration interface Kirti Wankhede
2019-07-16 20:56   ` Alex Williamson
2019-07-17 11:55     ` Cornelia Huck
2019-07-23 12:13     ` Cornelia Huck
2019-08-21 20:32       ` Kirti Wankhede
2019-08-21 20:31     ` Kirti Wankhede
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
2019-07-16 16:29   ` Cornelia Huck
2019-07-18 18:54     ` Kirti Wankhede
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
2019-07-16 16:32   ` Cornelia Huck
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-07-11 12:07   ` Dr. David Alan Gilbert
2019-08-22  4:50     ` Kirti Wankhede
2019-08-22  9:32       ` Dr. David Alan Gilbert
2019-08-22 19:10         ` Kirti Wankhede
2019-08-22 19:13           ` Dr. David Alan Gilbert
2019-08-22 23:57             ` Tian, Kevin
2019-08-23  9:26               ` Dr. David Alan Gilbert
2019-08-23  9:49                 ` Tian, Kevin
2019-07-16 21:14   ` Alex Williamson
2019-07-17  9:10     ` Dr. David Alan Gilbert
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
2019-07-16 21:37   ` Alex Williamson
2019-07-18 20:19     ` Kirti Wankhede
2019-07-23 12:52   ` Cornelia Huck
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2019-07-11 12:13   ` Dr. David Alan Gilbert
2019-07-11 19:14     ` Kirti Wankhede
2019-07-22  8:23       ` Yan Zhao
2019-08-20 20:31         ` Kirti Wankhede
2019-07-16 22:03   ` Alex Williamson
2019-07-22  8:37   ` Yan Zhao
2019-08-20 20:33     ` Kirti Wankhede
2019-08-23  1:32       ` Yan Zhao
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 07/13] vfio: Add migration state change notifier Kirti Wankhede
2019-07-17  2:25   ` Yan Zhao
2019-08-20 20:24     ` Kirti Wankhede
2019-08-23  0:54       ` Yan Zhao
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 08/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2019-07-22  8:34   ` Yan Zhao
2019-08-20 20:33     ` Kirti Wankhede
2019-08-23  1:23       ` Yan Zhao
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2019-07-12  2:44   ` Yan Zhao
2019-07-18 18:45     ` Kirti Wankhede
2019-07-17  2:50   ` Yan Zhao
2019-08-20 20:30     ` Kirti Wankhede
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 10/13] vfio: Add load " Kirti Wankhede
2019-07-12  2:52   ` Yan Zhao
2019-07-18 19:00     ` Kirti Wankhede
2019-07-22  3:20       ` Yan Zhao
2019-07-22 19:07         ` Alex Williamson
2019-07-22 21:50           ` Yan Zhao
2019-08-20 20:35             ` Kirti Wankhede
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 11/13] vfio: Add function to get dirty page list Kirti Wankhede
2019-07-12  0:33   ` Yan Zhao
2019-07-18 18:39     ` Kirti Wankhede
2019-07-19  1:24       ` Yan Zhao
2019-07-22  8:39   ` Yan Zhao
2019-08-20 20:34     ` Kirti Wankhede
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 12/13] vfio: Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
2019-07-23 13:18   ` Cornelia Huck
2019-07-09  9:49 ` [Qemu-devel] [PATCH v7 13/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
2019-07-11  2:55 ` [Qemu-devel] [PATCH v7 00/13] Add migration support for VFIO device Yan Zhao
2019-07-11 10:50   ` Dr. David Alan Gilbert
2019-07-11 11:47     ` Yan Zhao
2019-07-11 16:23       ` Dr. David Alan Gilbert
2019-07-11 19:08         ` Kirti Wankhede
2019-07-12  0:32           ` Yan Zhao
2019-07-18 18:32             ` Kirti Wankhede
2019-07-19  1:23               ` Yan Zhao
2019-07-24 11:32                 ` Dr. David Alan Gilbert
2019-07-12 17:42           ` Dr. David Alan Gilbert
2019-07-15  0:35             ` Yan Zhao
2019-07-12  0:14         ` Yan Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.