All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device
@ 2019-08-26 18:55 Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface Kirti Wankhede
                   ` (13 more replies)
  0 siblings, 14 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Add migration support for VFIO device

This Patch set include patches as below:
- Define KABI for VFIO device for migration support.
- Added save and restore functions for PCI configuration space
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.
  * Added function to get dirty pages bitmap for the pages which are used by
    driver.
- Add vfio_listerner_log_sync to mark dirty pages.
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    On migration completion, vCPUs stops and calls .save_live_complete_precopy
    for each active device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, _STOPPED)
                        |
    (FINISH_MIGRATE, _COMPLETED, STOPPED)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, STOPPED)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
                        |
        (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load as
  suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti


Kirti Wankhede (13):
  vfio: KABI for migration interface
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add VM state change handler to know state of VM
  vfio: Add migration state change notifier
  vfio: Register SaveVMHandlers for VFIO device
  vfio: Add save state functions to SaveVMHandlers
  vfio: Add load state functions to SaveVMHandlers
  vfio: Add function to get dirty page list
  vfio: Add vfio_listener_log_sync to mark dirty pages
  vfio: Make vfio-pci device migration capable.

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  55 +++
 hw/vfio/migration.c           | 848 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 191 +++++++++-
 hw/vfio/trace-events          |  19 +
 include/hw/vfio/vfio-common.h |  22 ++
 linux-headers/linux/vfio.h    | 148 ++++++++
 7 files changed, 1278 insertions(+), 7 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-28 20:50   ` Alex Williamson
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

- Defined MIGRATION region type and sub-type.
- Used 3 bits to define VFIO device states.
    Bit 0 => _RUNNING
    Bit 1 => _SAVING
    Bit 2 => _RESUMING
    Combination of these bits defines VFIO device's state during migration
    _STOPPED => All bits 0 indicates VFIO device stopped.
    _RUNNING => Normal VFIO device running state.
    _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
                          saving state of device i.e. pre-copy state
    _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
                          save device state,i.e. stop-n-copy state
    _RESUMING => VFIO device resuming state.
    _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
    Bits 3 - 31 are reserved for future use. User should perform
    read-modify-write operation on this field.
- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access:
    * device_state: (read/write)
        To convey VFIO device state to be transitioned to. Only 3 bits are used
        as of now, Bits 3 - 31 are reserved for future use.
    * pending bytes: (read only)
        To get pending bytes yet to be migrated for VFIO device.
    * data_offset: (read only)
        To get data offset in migration region from where data exist during
        _SAVING, from where data should be written by user space application
        during _RESUMING state and while read dirty pages bitmap.
    * data_size: (read/write)
        To get and set size of data copied in migration region during _SAVING
        and _RESUMING state.
    * start_pfn, page_size, total_pfns: (write only)
        To get bitmap of dirty pages from vendor driver from given
        start address for total_pfns.
    * copied_pfns: (read only)
        To get number of pfns bitmap copied in migration region.
        Vendor driver should copy the bitmap with bits set only for
        pages to be marked dirty in migration region. Vendor driver
        should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages dirty in
        requested range. Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL
        to mark all pages in the section as dirty.

Migration region looks like:
 ------------------------------------------------------------------
|vfio_device_migration_info|    data section                      |
|                          |     ///////////////////////////////  |
 ------------------------------------------------------------------
 ^                              ^                              ^
 offset 0-trapped part        data_offset                 data_size

Data section is always followed by vfio_device_migration_info
structure in the region, so data_offset will always be non-0.
Offset from where data is copied is decided by kernel driver, data
section can be trapped or mapped depending on how kernel driver
defines data section. If mmapped, then data_offset should be page
aligned, where as initial section which contain vfio_device_migration_info
structure might not end at offset which is page aligned.
Data_offset can be same or different for device data and dirty pages bitmap.
Vendor driver should decide whether to partition data section and how to
partition the data section. Vendor driver should return data_offset
accordingly.

For user application, data is opaque. User should write data in the same
order as received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 148 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 148 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 24f505199f83..4bc0236b0898 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
  */
 #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
 
+/* Migration region type and sub-type */
+#define VFIO_REGION_TYPE_MIGRATION	        (3)
+#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
+
+/**
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      To indicate vendor driver the state VFIO device should be transitioned
+ *      to. If device state transition fails, write on this field return error.
+ *      It consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
+ *        _STOPPED state. When device is changed to _STOPPED, driver should stop
+ *        device before write() returns.
+ *      - If bit 1 set, indicates _SAVING state.
+ *      - If bit 2 set, indicates _RESUMING state.
+ *      Bits 3 - 31 are reserved for future use. User should perform
+ *      read-modify-write operation on this field.
+ *      _SAVING and _RESUMING bits set at the same time is invalid state.
+ *
+ * pending bytes: (read only)
+ *      Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read device data during _SAVING state or write
+ *      device data during _RESUMING state or read dirty pages bitmap. See below
+ *      for detail of sequence to be followed.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to get size of data copied in
+ *      migration region during _SAVING state and write size of data copied in
+ *      migration region during _RESUMING state.
+ *
+ * start_pfn: (write only)
+ *      Start address pfn to get bitmap of dirty pages from vendor driver duing
+ *      _SAVING state.
+ *
+ * page_size: (write only)
+ *      User application should write the page_size of pfn.
+ *
+ * total_pfns: (write only)
+ *      Total pfn count from start_pfn for which dirty bitmap is requested.
+ *
+ * copied_pfns: (read only)
+ *      pfn count for which dirty bitmap is copied to migration region.
+ *      Vendor driver should copy the bitmap with bits set only for pages to be
+ *      marked dirty in migration region.
+ *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if none of the
+ *        pages are dirty in requested range or rest of the range.
+ *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
+ *        pages dirty in the given range or rest of the range.
+ *      - Vendor driver should return pfn count for which bitmap is written in
+ *        the region.
+ *
+ * Migration region looks like:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^                             ^
+ *  offset 0-trapped part        data_offset                 data_size
+ *
+ * Data section is always followed by vfio_device_migration_info structure
+ * in the region, so data_offset will always be non-0. Offset from where data
+ * is copied is decided by kernel driver, data section can be trapped or
+ * mapped or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned.
+ * Data_offset can be same or different for device data and dirty pages bitmap.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes. If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates kernel driver to write data to staging buffer.
+ * c. read data_size, amount of data in bytes written by vendor driver in
+ *    migration region.
+ * d. read data_size bytes of data from data_offset in the migration region.
+ * e. process data.
+ * f. Loop through a to e.
+ *
+ * To copy system memory content during migration, vendor driver should be able
+ * to report system memory pages which are dirtied by that driver. For such
+ * dirty page reporting, user application should query for a range of GFNs
+ * relative to device address space (IOVA), then vendor driver should provide
+ * the bitmap of pages from this range which are dirtied by him through
+ * migration region where each bit represents a page and bit set to 1 represents
+ * that the page is dirty.
+ * User space application should take care of copying content of system memory
+ * for those pages.
+ *
+ * Steps to get dirty page bitmap:
+ * a. write start_pfn, page_size and total_pfns.
+ * b. read copied_pfns. Vendor driver should take one of the below action:
+ *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if driver
+ *       doesn't have any page to report dirty in given range or rest of the
+ *       range. Exit the loop.
+ *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
+ *       pages dirty for given range or rest of the range. User space
+ *       application mark all pages in the range as dirty and exit the loop.
+ *     - Vendor driver should return copied_pfns and provide bitmap for
+ *       copied_pfn in migration region.
+ * c. read data_offset, where vendor driver has written bitmap.
+ * d. read bitmap from the migration region from data_offset.
+ * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
+ * b. write data of data_size to migration region from data_offset.
+ * c. write data_size which indicates vendor driver that data is written in
+ *    staging buffer.
+ *
+ * For user application, data is opaque. User should write data in the same
+ * order as received.
+ */
+
+struct vfio_device_migration_info {
+        __u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+                                     VFIO_DEVICE_STATE_SAVING | \
+                                     VFIO_DEVICE_STATE_RESUMING)
+#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
+                                     VFIO_DEVICE_STATE_RESUMING)
+        __u32 reserved;
+        __u64 pending_bytes;
+        __u64 data_offset;
+        __u64 data_size;
+        __u64 start_pfn;
+        __u64 page_size;
+        __u64 total_pfns;
+        __u64 copied_pfns;
+#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
+#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 02/13] vfio: Add function to unmap VFIO region
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

This function will be used for migration region.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/common.c              | 20 ++++++++++++++++++++
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 22 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3e03c495d868..c33c6684c06f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -966,6 +966,26 @@ int vfio_region_mmap(VFIORegion *region)
     return 0;
 }
 
+void vfio_region_unmap(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        trace_vfio_region_unmap(memory_region_name(&region->mmaps[i].mem),
+                                region->mmaps[i].offset,
+                                region->mmaps[i].offset +
+                                region->mmaps[i].size - 1);
+        memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
+        munmap(region->mmaps[i].mmap, region->mmaps[i].size);
+        object_unparent(OBJECT(&region->mmaps[i].mem));
+        region->mmaps[i].mmap = NULL;
+    }
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
     int i;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33ffd..8cdc27946cb8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -111,6 +111,7 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 9107bd41c030..93493891ba40 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                       int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Hook vfio_get_object callback for PCI devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/pci.c                 | 8 ++++++++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index dc3479c374e3..56166cae824f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2393,10 +2393,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice *vbasedev)
     }
 }
 
+static Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+    return OBJECT(vdev);
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
+    .vfio_get_object = vfio_pci_get_object,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 93493891ba40..771b6d59a3db 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -119,6 +119,7 @@ struct VFIODeviceOps {
     void (*vfio_compute_needs_reset)(VFIODevice *vdev);
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
+    Object *(*vfio_get_object)(VFIODevice *vdev);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 04/13] vfio: Add save and load functions for VFIO PCI devices
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (2 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

These functions save and restore PCI device specific data - config
space of PCI device.
Tested save and restore with MSI and MSIX type.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c                 | 168 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   2 +
 2 files changed, 170 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 56166cae824f..161068286592 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -40,6 +40,7 @@
 #include "pci.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/qemu-file.h"
 
 #define TYPE_VFIO_PCI "vfio-pci"
 #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
@@ -1618,6 +1619,55 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
     }
 }
 
+static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t addr;
+    uint32_t addr_lo, addr_hi = 0;
+
+    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
+    if (!bar->size) {
+        return 0;
+    }
+
+    /* skip IO BAR */
+    if (bar->ioport) {
+        return 0;
+    }
+
+    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
+
+    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
+                                       PCI_BASE_ADDRESS_MEM_MASK);
+    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
+        addr_hi = pci_default_read_config(pdev,
+                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
+    }
+
+    addr = ((uint64_t)addr_hi << 32) | addr_lo;
+
+    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int vfio_bars_validate(VFIOPCIDevice *vdev)
+{
+    int i, ret;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        ret = vfio_bar_validate(vdev, i);
+        if (ret) {
+            error_report("vfio: BAR address %d validation failed", i);
+            return ret;
+        }
+    }
+    return 0;
+}
+
 static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
@@ -2400,11 +2450,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
     return OBJECT(vdev);
 }
 
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint16_t pci_cmd;
+    int i;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar;
+
+        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+        qemu_put_be32(f, bar);
+    }
+
+    qemu_put_be32(f, vdev->interrupt);
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+
+        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                                            2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        qemu_put_be32(f, msi_addr_lo);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                             4);
+        }
+        qemu_put_be32(f, msi_addr_hi);
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+        qemu_put_be32(f, msi_data);
+    } else if (vdev->interrupt == VFIO_INT_MSIX) {
+        uint16_t offset;
+
+        /* save enable bit and maskall bit */
+        offset = pci_default_read_config(pdev,
+                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
+        qemu_put_be16(f, offset);
+        msix_save(pdev, f);
+    }
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    qemu_put_be16(f, pci_cmd);
+}
+
+static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t interrupt_type;
+    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+    uint16_t pci_cmd;
+    bool msi_64bit;
+    int i, ret;
+
+    /* retore pci bar configuration */
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar = qemu_get_be32(f);
+
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+    }
+
+    ret = vfio_bars_validate(vdev);
+    if (ret) {
+        return ret;
+    }
+
+    interrupt_type = qemu_get_be32(f);
+
+    if (interrupt_type == VFIO_INT_MSI) {
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+        msi_addr_lo = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo, 4);
+
+        msi_addr_hi = qemu_get_be32(f);
+        if (msi_64bit) {
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+        msi_data = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                msi_data, 2);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
+    } else if (interrupt_type == VFIO_INT_MSIX) {
+        uint16_t offset = qemu_get_be16(f);
+
+        /* load enable bit and maskall bit */
+        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
+                              offset, 2);
+        msix_load(pdev, f);
+    }
+    pci_cmd = qemu_get_be16(f);
+    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
+    return 0;
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
     .vfio_get_object = vfio_pci_get_object,
+    .vfio_save_config = vfio_pci_save_config,
+    .vfio_load_config = vfio_pci_load_config,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 771b6d59a3db..6ea4898c4d7e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
     Object *(*vfio_get_object)(VFIODevice *vdev);
+    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 05/13] vfio: Add migration region initialization and finalize function
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (3 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

- Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
  patch series.
- VFIO device supports migration or not is decided based of migration region
  query. If migration region query is successful and migration region
  initialization is successful then migration is supported else migration is
  blocked.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 140 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   3 +
 include/hw/vfio/vfio-common.h |  11 ++++
 4 files changed, 155 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index abad8b818c9b..36033d1437c5 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,4 +1,4 @@
-obj-y += common.o spapr.o
+obj-y += common.o spapr.o migration.o
 obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_VFIO_PLATFORM) += platform.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..a1feeb7e1a5a
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,140 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2019
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.size) {
+        vfio_region_exit(&migration->region);
+        vfio_region_finalize(&migration->region);
+    }
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    Object *obj = NULL;
+    int ret = -EINVAL;
+
+    if (!vbasedev->ops || !vbasedev->ops->vfio_get_object) {
+        return ret;
+    }
+
+    obj = vbasedev->ops->vfio_get_object(vbasedev);
+    if (!obj) {
+        return ret;
+    }
+
+    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
+                            "migration");
+    if (ret) {
+        error_report("%s: Failed to setup VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    if (!migration->region.size) {
+        ret = -EINVAL;
+        error_report("%s: Invalid region size of VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+                               struct vfio_region_info *info)
+{
+    int ret;
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+
+    ret = vfio_migration_region_init(vbasedev, info->index);
+    if (ret) {
+        error_report("%s: Failed to initialise migration region",
+                     vbasedev->name);
+        g_free(vbasedev->migration);
+        return ret;
+    }
+
+    return 0;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_region_info *info;
+    Error *local_err = NULL;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    ret = vfio_migration_init(vbasedev, info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    trace_vfio_migration_probe(vbasedev->name, info->index);
+    return 0;
+
+add_blocker:
+    error_setg(&vbasedev->migration_blocker,
+               "VFIO device doesn't support migration");
+    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vbasedev->migration_blocker);
+    }
+    return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+    }
+
+    vfio_migration_region_exit(vbasedev);
+
+    if (vbasedev->migration) {
+        g_free(vbasedev->migration);
+    }
+}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8cdc27946cb8..191a726a1312 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
 vfio_display_edid_link_down(void) ""
 vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
+
+# migration.c
+vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 6ea4898c4d7e..f80e04e26e1f 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -57,6 +57,12 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    VFIORegion region;
+    uint64_t pending_bytes;
+    QemuMutex lock;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -113,6 +119,8 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -204,4 +212,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 06/13] vfio: Add VM state change handler to know state of VM
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (4 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 07/13] vfio: Add migration state change notifier Kirti Wankhede
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

VM state change handler gets called on change in VM's state. This is used to set
VFIO device state to _RUNNING.
VM state change handler, migration state change handler and log_sync listener
are called asynchronously, which sometimes lead to data corruption in migration
region. Initialised mutex that is used to serialize operations on migration data
region during saving state.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 67 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  2 ++
 include/hw/vfio/vfio-common.h |  4 +++
 3 files changed, 73 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a1feeb7e1a5a..83057d909d49 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include <linux/vfio.h>
 
+#include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
@@ -74,6 +75,65 @@ err:
     return ret;
 }
 
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t set_flags,
+                                    uint32_t clear_flags)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint32_t device_state;
+    int ret = 0;
+
+    /* same flags should not be set or clear */
+    assert(!(set_flags & clear_flags));
+
+    device_state = (vbasedev->device_state | set_flags) & ~clear_flags;
+
+    if ((device_state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_INVALID) {
+        return -EINVAL;
+    }
+
+    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("%s: Failed to set device state %d %s",
+                     vbasedev->name, ret, strerror(errno));
+        return ret;
+    }
+
+    vbasedev->device_state = device_state;
+    trace_vfio_migration_set_state(vbasedev->name, device_state);
+    return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running)) {
+        int ret;
+        uint32_t set_flags = 0, clear_flags = 0;
+
+        if (running) {
+            set_flags = VFIO_DEVICE_STATE_RUNNING;
+            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
+                clear_flags = VFIO_DEVICE_STATE_RESUMING;
+            }
+        } else {
+            clear_flags = VFIO_DEVICE_STATE_RUNNING;
+        }
+
+        ret = vfio_migration_set_state(vbasedev, set_flags, clear_flags);
+        if (ret) {
+            error_report("%s: Failed to set device state 0x%x",
+                         vbasedev->name, set_flags & ~clear_flags);
+        }
+        vbasedev->vm_running = running;
+        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
+                                  set_flags & ~clear_flags);
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -89,6 +149,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
+
     return 0;
 }
 
@@ -127,6 +190,10 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+    }
+
     if (vbasedev->migration_blocker) {
         migrate_del_blocker(vbasedev->migration_blocker);
         error_free(vbasedev->migration_blocker);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 191a726a1312..3d15bacd031a 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f80e04e26e1f..15be0358845b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
+#include "sysemu/sysemu.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -121,6 +122,9 @@ typedef struct VFIODevice {
     unsigned int flags;
     VFIOMigration *migration;
     Error *migration_blocker;
+    uint32_t device_state;
+    VMChangeStateEntry *vm_state;
+    int vm_running;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 07/13] vfio: Add migration state change notifier
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (5 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 08/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to vendor
driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 28 ++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 30 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 83057d909d49..e97f1b0fe803 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -134,6 +134,26 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
     }
 }
 
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
+
+    switch (s->state) {
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
+                       VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING);
+        if (ret) {
+            error_report("%s: Failed to set state RUNNING", vbasedev->name);
+        }
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -152,6 +172,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
+
     return 0;
 }
 
@@ -190,6 +213,11 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+
+    if (vbasedev->migration_state.notify) {
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
     if (vbasedev->vm_state) {
         qemu_del_vm_change_state_handler(vbasedev->vm_state);
     }
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 3d15bacd031a..69503228f20e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,3 +148,4 @@ vfio_display_edid_write_error(void) ""
 vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 15be0358845b..dcab8a4ae0f9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -125,6 +125,7 @@ typedef struct VFIODevice {
     uint32_t device_state;
     VMChangeStateEntry *vm_state;
     int vm_running;
+    Notifier migration_state;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 08/13] vfio: Register SaveVMHandlers for VFIO device
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (6 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 07/13] vfio: Add migration state change notifier Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Define flags to be used as delimeter in migration file stream.
Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
region from these functions at source during saving or pre-copy phase.
Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |  2 ++
 2 files changed, 73 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e97f1b0fe803..1910a913cde2 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,6 +8,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
 #include <linux/vfio.h>
 
 #include "sysemu/runstate.h"
@@ -24,6 +25,17 @@
 #include "pci.h"
 #include "trace.h"
 
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -106,6 +118,63 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t set_flags,
     return 0;
 }
 
+/* ---------------------------------------------------------------------- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    if (migration->region.mmaps) {
+        qemu_mutex_lock_iothread();
+        ret = vfio_region_mmap(&migration->region);
+        qemu_mutex_unlock_iothread();
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.index,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING, 0);
+    if (ret) {
+        error_report("%s: Failed to set state SAVING", vbasedev->name);
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_setup(vbasedev->name);
+    return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->region.mmaps) {
+        vfio_region_unmap(&migration->region);
+    }
+    trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+};
+
+/* ---------------------------------------------------------------------- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -169,6 +238,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    register_savevm_live(vbasedev->dev, "vfio", -1, 1, &savevm_vfio_handlers,
+                         vbasedev);
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 69503228f20e..4bb43f18f315 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
+vfio_save_setup(char *name) " (%s)"
+vfio_save_cleanup(char *name) " (%s)"
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 09/13] vfio: Add save state functions to SaveVMHandlers
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (7 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 08/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 10/13] vfio: Add load " Kirti Wankhede
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes. If pending_bytes > 0, go through below steps.
- read data_offset - indicates kernel driver to write data to staging
  buffer.
- read data_size - amount of data in bytes written by vendor driver in
  migration region.
- read data_size bytes of data from data_offset in the migration region.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes. If pending_bytes > 0, go through below steps.
c. read data_offset - indicates kernel driver to write data to staging
   buffer.
d. read data_size - amount of data in bytes written by vendor driver in
   migration region.
e. read data_size bytes of data from data_offset in the migration region.
f. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
g. iterate through steps b to f while (pending_bytes > 0)
h. Write {VFIO_MIG_FLAG_END_OF_STATE}

When data region is mapped, its user's responsibility to read data from
data_offset of data_size before moving to next steps.

.save_live_iterate runs outside the iothread lock in the migration case, which
could race with asynchronous call to get dirty page list causing data corruption
in mapped migration region. Mutex added here to serial migration buffer read
operation.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 251 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events |   6 ++
 2 files changed, 256 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 1910a913cde2..3b81c1d6f5b3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -118,6 +118,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t set_flags,
     return 0;
 }
 
+static void *find_data_region(VFIORegion *region,
+                              uint64_t data_offset,
+                              uint64_t data_size)
+{
+    void *ptr = NULL;
+    int i;
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if ((data_offset >= region->mmaps[i].offset) &&
+            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
+            (data_size <= region->mmaps[i].size)) {
+            ptr = region->mmaps[i].mmap + (data_offset -
+                                           region->mmaps[i].offset);
+            break;
+        }
+    }
+    return ptr;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t data_offset = 0, data_size = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+    if (ret != sizeof(data_offset)) {
+        error_report("%s: Failed to get migration buffer data offset %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_size));
+    if (ret != sizeof(data_size)) {
+        error_report("%s: Failed to get migration buffer data size %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    if (data_size > 0) {
+        void *buf = NULL;
+        bool buffer_mmaped;
+
+        if (region->mmaps) {
+            buf = find_data_region(region, data_offset, data_size);
+        }
+
+        buffer_mmaped = (buf != NULL) ? true : false;
+
+        if (!buffer_mmaped) {
+            buf = g_try_malloc0(data_size);
+            if (!buf) {
+                error_report("%s: Error allocating buffer ", __func__);
+                return -ENOMEM;
+            }
+
+            ret = pread(vbasedev->fd, buf, data_size,
+                        region->fd_offset + data_offset);
+            if (ret != data_size) {
+                error_report("%s: Failed to get migration data %d",
+                             vbasedev->name, ret);
+                g_free(buf);
+                return -EINVAL;
+            }
+        }
+
+        qemu_put_be64(f, data_size);
+        qemu_put_buffer(f, buf, data_size);
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+    } else {
+        qemu_put_be64(f, data_size);
+    }
+
+    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+                           migration->pending_bytes);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return data_size;
+}
+
+static int vfio_update_pending(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t pending_bytes = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending_bytes));
+    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
+        error_report("%s: Failed to get pending bytes %d",
+                     vbasedev->name, ret);
+        migration->pending_bytes = 0;
+        return (ret < 0) ? ret : -EINVAL;
+    }
+
+    migration->pending_bytes = pending_bytes;
+    trace_vfio_update_pending(vbasedev->name, pending_bytes);
+    return 0;
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
+        vbasedev->ops->vfio_save_config(vbasedev, f);
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_device_config_state(vbasedev->name);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -134,7 +265,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         qemu_mutex_unlock_iothread();
         if (ret) {
             error_report("%s: Failed to mmap VFIO migration region %d: %s",
-                         vbasedev->name, migration->region.index,
+                         vbasedev->name, migration->region.nr,
                          strerror(-ret));
             return ret;
         }
@@ -168,9 +299,124 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return;
+    }
+
+    *res_precopy_only += migration->pending_bytes;
+
+    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
+                            *res_postcopy_only, *res_compatible);
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret, data_size;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+
+    qemu_mutex_lock(&migration->lock);
+    data_size = vfio_save_buffer(f, vbasedev);
+    qemu_mutex_unlock(&migration->lock);
+
+    if (data_size < 0) {
+        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
+                     strerror(errno));
+        return data_size;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_iterate(vbasedev->name, data_size);
+    if (data_size == 0) {
+        /* indicates data finished, goto complete phase */
+        return 1;
+    }
+
+    return 0;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_SAVING,
+                                   VFIO_DEVICE_STATE_RUNNING);
+    if (ret) {
+        error_report("%s: Failed to set state STOP and SAVING",
+                     vbasedev->name);
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    while (migration->pending_bytes > 0) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+        ret = vfio_save_buffer(f, vbasedev);
+        if (ret < 0) {
+            error_report("%s: Failed to save buffer", vbasedev->name);
+            return ret;
+        } else if (ret == 0) {
+            break;
+        }
+
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, 0, VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state STOPPED", vbasedev->name);
+        return ret;
+    }
+
+    trace_vfio_save_complete_precopy(vbasedev->name);
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
 };
 
 /* ---------------------------------------------------------------------- */
@@ -238,6 +484,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    qemu_mutex_init(&vbasedev->migration->lock);
+
     register_savevm_live(vbasedev->dev, "vfio", -1, 1, &savevm_vfio_handlers,
                          vbasedev);
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
@@ -298,6 +546,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
         error_free(vbasedev->migration_blocker);
     }
 
+    qemu_mutex_destroy(&vbasedev->migration->lock);
     vfio_migration_region_exit(vbasedev);
 
     if (vbasedev->migration) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 4bb43f18f315..bdf40ba368c7 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -151,3 +151,9 @@ vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_st
 vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
 vfio_save_setup(char *name) " (%s)"
 vfio_save_cleanup(char *name) " (%s)"
+vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
+vfio_update_pending(char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
+vfio_save_device_config_state(char *name) " (%s)"
+vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
+vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
+vfio_save_complete_precopy(char *name) " (%s)"
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 10/13] vfio: Add load state functions to SaveVMHandlers
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (8 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 11/13] vfio: Add function to get dirty page list Kirti Wankhede
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Sequence  during _RESUMING device state:
While data for this device is available, repeat below steps:
a. read data_offset from where user application should write data.
b. write data of data_size to migration region from data_offset.
c. write data_size which indicates vendor driver that data is written in
   staging buffer.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   3 +
 2 files changed, 173 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3b81c1d6f5b3..765015fdc2dd 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -249,6 +249,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    uint64_t data;
+
+    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+        int ret;
+
+        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
+        if (ret) {
+            error_report("%s: Failed to load device config space",
+                         vbasedev->name);
+            return ret;
+        }
+    }
+
+    data = qemu_get_be64(f);
+    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("%s: Failed loading device config space, "
+                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+        return -EINVAL;
+    }
+
+    trace_vfio_load_device_config_state(vbasedev->name);
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -411,12 +438,155 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+
+    if (migration->region.mmaps) {
+        ret = vfio_region_mmap(&migration->region);
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.nr,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING, 0);
+    if (ret) {
+        error_report("%s: Failed to set state RESUMING", vbasedev->name);
+    }
+    return ret;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    vfio_save_cleanup(opaque);
+    return 0;
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+    uint64_t data, data_size;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+
+        trace_vfio_load_state(vbasedev->name, data);
+
+        switch (data) {
+        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
+        {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
+        {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return ret;
+            } else {
+                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
+                             vbasedev->name, data);
+                return -EINVAL;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE:
+        {
+            VFIORegion *region = &migration->region;
+            void *buf = NULL;
+            bool buffer_mmaped = false;
+            uint64_t data_offset = 0;
+
+            data_size = qemu_get_be64(f);
+            if (data_size == 0) {
+                break;
+            }
+
+            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                        region->fd_offset +
+                        offsetof(struct vfio_device_migration_info,
+                        data_offset));
+            if (ret != sizeof(data_offset)) {
+                error_report("%s:Failed to get migration buffer data offset %d",
+                             vbasedev->name, ret);
+                return -EINVAL;
+            }
+
+            if (region->mmaps) {
+                buf = find_data_region(region, data_offset, data_size);
+            }
+
+            buffer_mmaped = (buf != NULL) ? true : false;
+
+            if (!buffer_mmaped) {
+                buf = g_try_malloc0(data_size);
+                if (!buf) {
+                    error_report("%s: Error allocating buffer ", __func__);
+                    return -ENOMEM;
+                }
+            }
+
+            qemu_get_buffer(f, buf, data_size);
+
+            if (!buffer_mmaped) {
+                ret = pwrite(vbasedev->fd, buf, data_size,
+                             region->fd_offset + data_offset);
+                g_free(buf);
+
+                if (ret != data_size) {
+                    error_report("%s: Failed to set migration buffer %d",
+                                 vbasedev->name, ret);
+                    return -EINVAL;
+                }
+            }
+
+            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
+                         region->fd_offset +
+                       offsetof(struct vfio_device_migration_info, data_size));
+            if (ret != sizeof(data_size)) {
+                error_report("%s: Failed to set migration buffer data size %d",
+                             vbasedev->name, ret);
+                if (!buffer_mmaped) {
+                    g_free(buf);
+                }
+                return -EINVAL;
+            }
+
+            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
+                                              data_size);
+            break;
+        }
+        }
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+        data = qemu_get_be64(f);
+    }
+
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index bdf40ba368c7..ac065b559f4e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -157,3 +157,6 @@ vfio_save_device_config_state(char *name) " (%s)"
 vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
 vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(char *name) " (%s)"
+vfio_load_device_config_state(char *name) " (%s)"
+vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 11/13] vfio: Add function to get dirty page list
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (9 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 10/13] vfio: Add load " Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 12/13] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Dirty page tracking (.log_sync) is part of RAM copying state, where
vendor driver provides the bitmap of pages which are dirtied by vendor
driver through migration region and as part of RAM copy, those pages
gets copied to file stream.

To get dirty page bitmap:
- write start address, page_size and pfn count.
- read count of pfns copied. Vendor driver should take one of the below action:
    - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if driver
      doesn't have any page to report dirty in given range or rest of the range
    - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all pages
      dirty for given range or rest of the range.
    - Vendor driver should return copied_pfns and provide bitmap for copied_pfn
      in migration region.
- read data_offset, where vendor driver has written bitmap.
- read bitmap from from the migration region from data_offset.
- Iterate above steps till page bitmap for all requested pfns are copied.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 123 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   1 +
 include/hw/vfio/vfio-common.h |   2 +
 3 files changed, 126 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 765015fdc2dd..eff4b2a4a6e8 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -276,6 +276,129 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+void vfio_get_dirty_page_list(VFIODevice *vbasedev,
+                              uint64_t start_pfn,
+                              uint64_t pfn_count,
+                              uint64_t page_size)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t count = 0;
+    int64_t copied_pfns = 0;
+    int64_t total_pfns = pfn_count;
+    int ret;
+
+    qemu_mutex_lock(&migration->lock);
+
+    while (total_pfns > 0) {
+        uint64_t bitmap_size, data_offset = 0;
+        uint64_t start = start_pfn + count;
+        void *buf = NULL;
+        bool buffer_mmaped = false;
+
+        ret = pwrite(vbasedev->fd, &start, sizeof(start),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              start_pfn));
+        if (ret < 0) {
+            error_report("%s: Failed to set dirty pages start address %d %s",
+                         vbasedev->name, ret, strerror(errno));
+            goto dpl_unlock;
+        }
+
+        ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              page_size));
+        if (ret < 0) {
+            error_report("%s: Failed to set dirty page size %d %s",
+                         vbasedev->name, ret, strerror(errno));
+            goto dpl_unlock;
+        }
+
+        ret = pwrite(vbasedev->fd, &total_pfns, sizeof(total_pfns),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              total_pfns));
+        if (ret < 0) {
+            error_report("%s: Failed to set dirty page total pfns %d %s",
+                         vbasedev->name, ret, strerror(errno));
+            goto dpl_unlock;
+        }
+
+        /* Read copied dirty pfns */
+        ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             copied_pfns));
+        if (ret < 0) {
+            error_report("%s: Failed to get dirty pages bitmap count %d %s",
+                         vbasedev->name, ret, strerror(errno));
+            goto dpl_unlock;
+        }
+
+        if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_NONE) {
+            /*
+             * copied_pfns could be 0 if driver doesn't have any page to
+             * report dirty in given range
+             */
+            break;
+        } else if (copied_pfns == VFIO_DEVICE_DIRTY_PFNS_ALL) {
+            /* Mark all pages dirty for this range */
+            cpu_physical_memory_set_dirty_range(start * page_size,
+                                                total_pfns * page_size,
+                                                DIRTY_MEMORY_MIGRATION);
+            break;
+        }
+
+        bitmap_size = BITS_TO_LONGS(copied_pfns) * sizeof(unsigned long);
+
+        ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+        if (ret != sizeof(data_offset)) {
+            error_report("%s: Failed to get migration buffer data offset %d",
+                         vbasedev->name, ret);
+            goto dpl_unlock;
+        }
+
+        if (region->mmaps) {
+            buf = find_data_region(region, data_offset, bitmap_size);
+        }
+
+        buffer_mmaped = (buf != NULL) ? true : false;
+
+        if (!buffer_mmaped) {
+            buf = g_try_malloc0(bitmap_size);
+            if (!buf) {
+                error_report("%s: Error allocating buffer ", __func__);
+                goto dpl_unlock;
+            }
+
+            ret = pread(vbasedev->fd, buf, bitmap_size,
+                        region->fd_offset + data_offset);
+            if (ret != bitmap_size) {
+                error_report("%s: Failed to get dirty pages bitmap %d",
+                             vbasedev->name, ret);
+                g_free(buf);
+                goto dpl_unlock;
+            }
+        }
+
+        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
+                                               start * page_size,
+                                               copied_pfns);
+        count      += copied_pfns;
+        total_pfns -= copied_pfns;
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+    }
+
+    trace_vfio_get_dirty_page_list(vbasedev->name, start_pfn, pfn_count,
+                                   page_size);
+
+dpl_unlock:
+    qemu_mutex_unlock(&migration->lock);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index ac065b559f4e..414a5e69ec5e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
 vfio_load_device_config_state(char *name) " (%s)"
 vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 0x%"PRIx64
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index dcab8a4ae0f9..41ff5ebba27d 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -219,5 +219,7 @@ int vfio_spapr_remove_window(VFIOContainer *container,
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
 void vfio_migration_finalize(VFIODevice *vbasedev);
+void vfio_get_dirty_page_list(VFIODevice *vbasedev, uint64_t start_pfn,
+                               uint64_t pfn_count, uint64_t page_size);
 
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 12/13] vfio: Add vfio_listener_log_sync to mark dirty pages
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (10 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 11/13] vfio: Add function to get dirty page list Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 13/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
  2019-08-26 19:43 ` [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device no-reply
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

vfio_listener_log_sync gets list of dirty pages from vendor driver and mark
those pages dirty when in _SAVING state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c33c6684c06f..23f3d3c7c46a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -38,6 +38,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -796,9 +797,43 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_listerner_log_sync(MemoryListener *listener,
+        MemoryRegionSection *section)
+{
+    uint64_t start_addr, size, pfn_count;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    if (memory_region_is_ram_device(section->mr)) {
+        return;
+    }
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
+                continue;
+            } else {
+                return;
+            }
+        }
+    }
+
+    start_addr = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    size = int128_get64(section->size);
+    pfn_count = size >> TARGET_PAGE_BITS;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vfio_get_dirty_page_list(vbasedev, start_addr >> TARGET_PAGE_BITS,
+                                     pfn_count, TARGET_PAGE_SIZE);
+        }
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 13/13] vfio: Make vfio-pci device migration capable.
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (11 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 12/13] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
@ 2019-08-26 18:55 ` Kirti Wankhede
  2019-08-26 19:43 ` [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device no-reply
  13 siblings, 0 replies; 34+ messages in thread
From: Kirti Wankhede @ 2019-08-26 18:55 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Call vfio_migration_probe() and vfio_migration_finalize() functions for
vfio-pci device to enable migration for vfio PCI device.
Removed vfio_pci_vmstate structure.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 161068286592..514cf1b0ce16 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2911,6 +2911,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = DEVICE(vdev);
+    vdev->vbasedev.device_state = 0;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3171,6 +3172,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    ret = vfio_migration_probe(&vdev->vbasedev, errp);
+    if (ret) {
+            error_report("%s: Failed to setup for migration",
+                         vdev->vbasedev.name);
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3190,6 +3197,7 @@ static void vfio_instance_finalize(Object *obj)
     VFIOPCIDevice *vdev = PCI_VFIO(obj);
     VFIOGroup *group = vdev->vbasedev.group;
 
+    vdev->vbasedev.device_state = 0;
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
     g_free(vdev->emulated_config_bits);
@@ -3218,6 +3226,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
@@ -3326,11 +3335,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3338,7 +3342,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device
  2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
                   ` (12 preceding siblings ...)
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 13/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2019-08-26 19:43 ` no-reply
  13 siblings, 0 replies; 34+ messages in thread
From: no-reply @ 2019-08-26 19:43 UTC (permalink / raw)
  To: kwankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, kwankhede,
	eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic, felipe,
	Ken.Xue, kevin.tian, yan.y.zhao, dgilbert, alex.williamson,
	changpeng.liu, cohuck, zhi.a.wang, jonathan.davies

Patchew URL: https://patchew.org/QEMU/1566845753-18993-1-git-send-email-kwankhede@nvidia.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 1566845753-18993-1-git-send-email-kwankhede@nvidia.com
Subject: [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
0aeba38 vfio: Make vfio-pci device migration capable.
acc0d2b vfio: Add vfio_listener_log_sync to mark dirty pages
cb11cd6 vfio: Add function to get dirty page list
6d46042 vfio: Add load state functions to SaveVMHandlers
1f88428 vfio: Add save state functions to SaveVMHandlers
d0fbf18 vfio: Register SaveVMHandlers for VFIO device
04097e1 vfio: Add migration state change notifier
c3b9857 vfio: Add VM state change handler to know state of VM
a712a3a vfio: Add migration region initialization and finalize function
78b6920 vfio: Add save and load functions for VFIO PCI devices
032d272 vfio: Add vfio_get_object callback to VFIODeviceOps
95817ed vfio: Add function to unmap VFIO region
eaf5be5 vfio: KABI for migration interface

=== OUTPUT BEGIN ===
1/13 Checking commit eaf5be5b94f3 (vfio: KABI for migration interface)
2/13 Checking commit 95817edc42f9 (vfio: Add function to unmap VFIO region)
3/13 Checking commit 032d272ca311 (vfio: Add vfio_get_object callback to VFIODeviceOps)
4/13 Checking commit 78b692082884 (vfio: Add save and load functions for VFIO PCI devices)
5/13 Checking commit a712a3a74713 (vfio: Add migration region initialization and finalize function)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#29: 
new file mode 100644

ERROR: g_free(NULL) is safe this check is probably not required
#171: FILE: hw/vfio/migration.c:138:
+    if (vbasedev->migration) {
+        g_free(vbasedev->migration);

total: 1 errors, 1 warnings, 178 lines checked

Patch 5/13 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

6/13 Checking commit c3b98575e39b (vfio: Add VM state change handler to know state of VM)
7/13 Checking commit 04097e167c8b (vfio: Add migration state change notifier)
8/13 Checking commit d0fbf181b9db (vfio: Register SaveVMHandlers for VFIO device)
9/13 Checking commit 1f88428a8340 (vfio: Add save state functions to SaveVMHandlers)
10/13 Checking commit 6d46042143b9 (vfio: Add load state functions to SaveVMHandlers)
11/13 Checking commit cb11cd6229f8 (vfio: Add function to get dirty page list)
12/13 Checking commit acc0d2baac7d (vfio: Add vfio_listener_log_sync to mark dirty pages)
13/13 Checking commit 0aeba384447b (vfio: Make vfio-pci device migration capable.)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/1566845753-18993-1-git-send-email-kwankhede@nvidia.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface Kirti Wankhede
@ 2019-08-28 20:50   ` Alex Williamson
  2019-08-30  7:25     ` Tian, Kevin
       [not found]     ` <AADFC41AFE54684AB9EE6CBC0274A5D19D553133@SHSMSX104.ccr.corp.intel.com>
  0 siblings, 2 replies; 34+ messages in thread
From: Alex Williamson @ 2019-08-28 20:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 27 Aug 2019 00:25:41 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> - Used 3 bits to define VFIO device states.
>     Bit 0 => _RUNNING
>     Bit 1 => _SAVING
>     Bit 2 => _RESUMING
>     Combination of these bits defines VFIO device's state during migration
>     _STOPPED => All bits 0 indicates VFIO device stopped.
>     _RUNNING => Normal VFIO device running state.
>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>                           saving state of device i.e. pre-copy state
>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>                           save device state,i.e. stop-n-copy state
>     _RESUMING => VFIO device resuming state.
>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>     Bits 3 - 31 are reserved for future use. User should perform
>     read-modify-write operation on this field.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access:
>     * device_state: (read/write)
>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>         as of now, Bits 3 - 31 are reserved for future use.
>     * pending bytes: (read only)
>         To get pending bytes yet to be migrated for VFIO device.
>     * data_offset: (read only)
>         To get data offset in migration region from where data exist during
>         _SAVING, from where data should be written by user space application
>         during _RESUMING state and while read dirty pages bitmap.
>     * data_size: (read/write)
>         To get and set size of data copied in migration region during _SAVING
>         and _RESUMING state.
>     * start_pfn, page_size, total_pfns: (write only)
>         To get bitmap of dirty pages from vendor driver from given
>         start address for total_pfns.
>     * copied_pfns: (read only)
>         To get number of pfns bitmap copied in migration region.
>         Vendor driver should copy the bitmap with bits set only for
>         pages to be marked dirty in migration region. Vendor driver
>         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages dirty in
>         requested range. Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL
>         to mark all pages in the section as dirty.
> 
> Migration region looks like:
>  ------------------------------------------------------------------
> |vfio_device_migration_info|    data section                      |
> |                          |     ///////////////////////////////  |
>  ------------------------------------------------------------------
>  ^                              ^                              ^
>  offset 0-trapped part        data_offset                 data_size
> 
> Data section is always followed by vfio_device_migration_info
> structure in the region, so data_offset will always be non-0.
> Offset from where data is copied is decided by kernel driver, data
> section can be trapped or mapped depending on how kernel driver
> defines data section. If mmapped, then data_offset should be page
> aligned, where as initial section which contain vfio_device_migration_info
> structure might not end at offset which is page aligned.
> Data_offset can be same or different for device data and dirty pages bitmap.
> Vendor driver should decide whether to partition data section and how to
> partition the data section. Vendor driver should return data_offset
> accordingly.
> 
> For user application, data is opaque. User should write data in the same
> order as received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 148 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 148 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 24f505199f83..4bc0236b0898 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
>   */
>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>  
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      To indicate vendor driver the state VFIO device should be transitioned
> + *      to. If device state transition fails, write on this field return error.
> + *      It consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> + *        device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state.
> + *      - If bit 2 set, indicates _RESUMING state.
> + *      Bits 3 - 31 are reserved for future use. User should perform
> + *      read-modify-write operation on this field.
> + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> + *
> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state or read dirty pages bitmap. See below
> + *      for detail of sequence to be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      migration region during _SAVING state and write size of data copied in
> + *      migration region during _RESUMING state.
> + *
> + * start_pfn: (write only)
> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> + *      _SAVING state.
> + *
> + * page_size: (write only)
> + *      User application should write the page_size of pfn.
> + *
> + * total_pfns: (write only)
> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> + *
> + * copied_pfns: (read only)
> + *      pfn count for which dirty bitmap is copied to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if none of the
> + *        pages are dirty in requested range or rest of the range.
> + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
> + *        pages dirty in the given range or rest of the range.
> + *      - Vendor driver should return pfn count for which bitmap is written in
> + *        the region.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^                             ^
> + *  offset 0-trapped part        data_offset                 data_size
> + *
> + * Data section is always followed by vfio_device_migration_info structure
> + * in the region, so data_offset will always be non-0. Offset from where data
> + * is copied is decided by kernel driver, data section can be trapped or
> + * mapped or partitioned, depending on how kernel driver defines data section.
> + * Data section partition can be defined as mapped by sparse mmap capability.
> + * If mmapped, then data_offset should be page aligned, where as initial section
> + * which contain vfio_device_migration_info structure might not end at offset
> + * which is page aligned.
> + * Data_offset can be same or different for device data and dirty pages bitmap.
> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.
> + *
> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> + * and for _SAVING device state or stop-and-copy phase:
> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> + * c. read data_size, amount of data in bytes written by vendor driver in
> + *    migration region.
> + * d. read data_size bytes of data from data_offset in the migration region.
> + * e. process data.
> + * f. Loop through a to e.

Something needs to be said here about the availability of the data, for
example, what indicates to the vendor driver that the above operation is
complete?  Is the data immutable?

> + *
> + * To copy system memory content during migration, vendor driver should be able
> + * to report system memory pages which are dirtied by that driver. For such
> + * dirty page reporting, user application should query for a range of GFNs
> + * relative to device address space (IOVA), then vendor driver should provide
> + * the bitmap of pages from this range which are dirtied by him through
> + * migration region where each bit represents a page and bit set to 1 represents
> + * that the page is dirty.
> + * User space application should take care of copying content of system memory
> + * for those pages.

Can we say that device state and dirty pfn operations on the data
area may be intermixed in any order the user chooses?

Should we specify that bits accumulate since EITHER a) _SAVING state is
enabled or b) a given pfn was last reported via the below sequence (ie.
dirty bits are cleared once reported)?

How does QEMU handle the fact that IOVAs are potentially dynamic while
performing the live portion of a migration?  For example, each time a
guest driver calls dma_map_page() or dma_unmap_page(), a
MemoryRegionSection pops in or out of the AddressSpace for the device
(I'm assuming a vIOMMU where the device AddressSpace is not
system_memory).  I don't see any QEMU code that intercepts that change
in the AddressSpace such that the IOVA dirty pfns could be recorded and
translated to GFNs.  The vendor driver can't track these beyond getting
an unmap notification since it only knows the IOVA pfns, which can be
re-used with different GFN backing.  Once the DMA mapping is torn down,
it seems those dirty pfns are lost in the ether.  If this works in QEMU,
please help me find the code that handles it.

> + *
> + * Steps to get dirty page bitmap:
> + * a. write start_pfn, page_size and total_pfns.

This is not well specified.  Is it intended that the user write all
three of these on every iteration, or could they write start_pfn=0,
page_size=4K, total_pfns=1, complete the steps below, then write
start_pfn=1 and immediately begin the next iteration?  They've written
all three, though not all on the current iteration, does that count?
Furthermore, could the user simple re-read copied_pfns to determine if
anything in the previously setup range has been re-dirtied?

IOW, are these three "registers" sticky or do the below operations
invalidate them?  If they're invalidated, then there needs to be a
mechanism to generate an error, such as below.

> + * b. read copied_pfns. Vendor driver should take one of the below action:
> + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if driver
> + *       doesn't have any page to report dirty in given range or rest of the
> + *       range. Exit the loop.
> + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
> + *       pages dirty for given range or rest of the range. User space
> + *       application mark all pages in the range as dirty and exit the loop.
> + *     - Vendor driver should return copied_pfns and provide bitmap for
> + *       copied_pfn in migration region.

Read returns errno if the pre-requisite registers are not valid?

> + * c. read data_offset, where vendor driver has written bitmap.
> + * d. read bitmap from the migration region from data_offset.
> + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)

It seems like the intent here is that the user effectively does:

start_pfn += copied_pfns
total_pfns -= copied_pfns
page_size = page_size?

But are they under any obligation to do so?

Also same question above regarding data availability/life-cycle.  Is
the vendor driver responsible for making the data available
indefinitely?  Seems it's only released at the next iteration, or
re-use of the data area for another operation, or clearing of the
_SAVING state bit.

> + * Sequence to be followed while _RESUMING device state:
> + * While data for this device is available, repeat below steps:
> + * a. read data_offset from where user application should write data.
> + * b. write data of data_size to migration region from data_offset.
> + * c. write data_size which indicates vendor driver that data is written in
> + *    staging buffer.
> + *
> + * For user application, data is opaque. User should write data in the same
> + * order as received.

Additionally, implicit synchronization between _SAVING and _RESUMING
ends within the vendor driver is assumed.

Are there any assumptions we haven't covered with respect to mmaps?
For instance, can the user setup mmaps at any time or only during
certain device states?  Are there recommended best practices for users
to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
mechanism, it might be nice to use it when either of these bits are
cleared.  Thanks,

Alex

> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +                                     VFIO_DEVICE_STATE_SAVING | \
> +                                     VFIO_DEVICE_STATE_RESUMING)
> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> +                                     VFIO_DEVICE_STATE_RESUMING)
> +        __u32 reserved;
> +        __u64 pending_bytes;
> +        __u64 data_offset;
> +        __u64 data_size;
> +        __u64 start_pfn;
> +        __u64 page_size;
> +        __u64 total_pfns;
> +        __u64 copied_pfns;
> +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-08-28 20:50   ` Alex Williamson
@ 2019-08-30  7:25     ` Tian, Kevin
  2019-08-30 16:15       ` Alex Williamson
       [not found]     ` <AADFC41AFE54684AB9EE6CBC0274A5D19D553133@SHSMSX104.ccr.corp.intel.com>
  1 sibling, 1 reply; 34+ messages in thread
From: Tian, Kevin @ 2019-08-30  7:25 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, eauger, felipe, jonathan.davies, Zhao, Yan Y, Liu,
	Changpeng, Ken.Xue

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, August 29, 2019 4:51 AM
> 
> On Tue, 27 Aug 2019 00:25:41 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > - Defined MIGRATION region type and sub-type.
> > - Used 3 bits to define VFIO device states.
> >     Bit 0 => _RUNNING
> >     Bit 1 => _SAVING
> >     Bit 2 => _RESUMING
> >     Combination of these bits defines VFIO device's state during migration
> >     _STOPPED => All bits 0 indicates VFIO device stopped.
> >     _RUNNING => Normal VFIO device running state.
> >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> start
> >                           saving state of device i.e. pre-copy state
> >     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> >                           save device state,i.e. stop-n-copy state
> >     _RESUMING => VFIO device resuming state.
> >     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits
> are set
> >     Bits 3 - 31 are reserved for future use. User should perform
> >     read-modify-write operation on this field.
> > - Defined vfio_device_migration_info structure which will be placed at 0th
> >   offset of migration region to get/set VFIO device related information.
> >   Defined members of structure and usage on read/write access:
> >     * device_state: (read/write)
> >         To convey VFIO device state to be transitioned to. Only 3 bits are used
> >         as of now, Bits 3 - 31 are reserved for future use.
> >     * pending bytes: (read only)
> >         To get pending bytes yet to be migrated for VFIO device.
> >     * data_offset: (read only)
> >         To get data offset in migration region from where data exist during
> >         _SAVING, from where data should be written by user space application
> >         during _RESUMING state and while read dirty pages bitmap.
> >     * data_size: (read/write)
> >         To get and set size of data copied in migration region during _SAVING
> >         and _RESUMING state.
> >     * start_pfn, page_size, total_pfns: (write only)
> >         To get bitmap of dirty pages from vendor driver from given
> >         start address for total_pfns.
> >     * copied_pfns: (read only)
> >         To get number of pfns bitmap copied in migration region.
> >         Vendor driver should copy the bitmap with bits set only for
> >         pages to be marked dirty in migration region. Vendor driver
> >         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages
> dirty in
> >         requested range. Vendor driver should return
> VFIO_DEVICE_DIRTY_PFNS_ALL
> >         to mark all pages in the section as dirty.
> >
> > Migration region looks like:
> >  ------------------------------------------------------------------
> > |vfio_device_migration_info|    data section                      |
> > |                          |     ///////////////////////////////  |
> >  ------------------------------------------------------------------
> >  ^                              ^                              ^
> >  offset 0-trapped part        data_offset                 data_size
> >
> > Data section is always followed by vfio_device_migration_info
> > structure in the region, so data_offset will always be non-0.
> > Offset from where data is copied is decided by kernel driver, data
> > section can be trapped or mapped depending on how kernel driver
> > defines data section. If mmapped, then data_offset should be page
> > aligned, where as initial section which contain vfio_device_migration_info
> > structure might not end at offset which is page aligned.
> > Data_offset can be same or different for device data and dirty pages bitmap.
> > Vendor driver should decide whether to partition data section and how to
> > partition the data section. Vendor driver should return data_offset
> > accordingly.
> >
> > For user application, data is opaque. User should write data in the same
> > order as received.
> >
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  linux-headers/linux/vfio.h | 148
> +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 148 insertions(+)
> >
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index 24f505199f83..4bc0236b0898 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
> >   */
> >  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> >
> > +/* Migration region type and sub-type */
> > +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> > +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> > +
> > +/**
> > + * Structure vfio_device_migration_info is placed at 0th offset of
> > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
> related migration
> > + * information. Field accesses from this structure are only supported at
> their
> > + * native width and alignment, otherwise the result is undefined and
> vendor
> > + * drivers should return an error.
> > + *
> > + * device_state: (read/write)
> > + *      To indicate vendor driver the state VFIO device should be
> transitioned
> > + *      to. If device state transition fails, write on this field return error.
> > + *      It consists of 3 bits:
> > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> > + *        _STOPPED state. When device is changed to _STOPPED, driver should
> stop
> > + *        device before write() returns.
> > + *      - If bit 1 set, indicates _SAVING state.
> > + *      - If bit 2 set, indicates _RESUMING state.

please add a few words to explain _SAVING and _RESUMING, similar to 
what you did for _RUNNING.

> > + *      Bits 3 - 31 are reserved for future use. User should perform
> > + *      read-modify-write operation on this field.
> > + *      _SAVING and _RESUMING bits set at the same time is invalid state.

what about _RUNNING | _RESUMING? Is it allowed?

> > + *
> > + * pending bytes: (read only)
> > + *      Number of pending bytes yet to be migrated from vendor driver
> > + *
> > + * data_offset: (read only)
> > + *      User application should read data_offset in migration region from
> where
> > + *      user application should read device data during _SAVING state or
> write
> > + *      device data during _RESUMING state or read dirty pages bitmap. See
> below
> > + *      for detail of sequence to be followed.
> > + *
> > + * data_size: (read/write)
> > + *      User application should read data_size to get size of data copied in
> > + *      migration region during _SAVING state and write size of data copied
> in
> > + *      migration region during _RESUMING state.
> > + *
> > + * start_pfn: (write only)
> > + *      Start address pfn to get bitmap of dirty pages from vendor driver
> duing
> > + *      _SAVING state.
> > + *
> > + * page_size: (write only)
> > + *      User application should write the page_size of pfn.
> > + *
> > + * total_pfns: (write only)
> > + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> > + *
> > + * copied_pfns: (read only)

'copied' gives the impression as if the page content is copied. what about
'dirty_pfns'? btw can this field merge with total_pfns?

> > + *      pfn count for which dirty bitmap is copied to migration region.
> > + *      Vendor driver should copy the bitmap with bits set only for pages to
> be
> > + *      marked dirty in migration region.
> > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> none of the
> > + *        pages are dirty in requested range or rest of the range.
> > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark
> all
> > + *        pages dirty in the given range or rest of the range.
> > + *      - Vendor driver should return pfn count for which bitmap is written in
> > + *        the region.
> > + *
> > + * Migration region looks like:
> > + *  ------------------------------------------------------------------
> > + * |vfio_device_migration_info|    data section                      |
> > + * |                          |     ///////////////////////////////  |
> > + * ------------------------------------------------------------------
> > + *   ^                              ^                             ^
> > + *  offset 0-trapped part        data_offset                 data_size

'data_size' -> "data_offset + data_size"

> > + *
> > + * Data section is always followed by vfio_device_migration_info structure

Data section is always following ..., or ... structure is always followed by
data section.

> > + * in the region, so data_offset will always be non-0. Offset from where
> data
> > + * is copied is decided by kernel driver, data section can be trapped or
> > + * mapped or partitioned, depending on how kernel driver defines data
> section.
> > + * Data section partition can be defined as mapped by sparse mmap
> capability.
> > + * If mmapped, then data_offset should be page aligned, where as initial
> section
> > + * which contain vfio_device_migration_info structure might not end at
> offset
> > + * which is page aligned.
> > + * Data_offset can be same or different for device data and dirty pages
> bitmap.
> > + * Vendor driver should decide whether to partition data section and how
> to
> > + * partition the data section. Vendor driver should return data_offset
> > + * accordingly.

Here lacks of high-level summary about how to differentiate reading device
state from reading dirty page bitmap, when both use the same interface 
(data_offset) to convey information to user space.

From below sequence example, looks reading device state is initiated by
reading pending_bytes, while reading dirty bitmap is marked by writing
start_pfn. In case of shared data region between two operations, they have
to be mutually-exclusive i.e. one must wait for the other to complete. Even
when the region is partitioned, data_offset itself could be raced if pending_
bytes and start_pfn are accessed at the same time. How do we expect the
application to cope with it? Isn't it too limiting with such design?

Since you anyway invent different sets of fields for two operations, why not
forcing partitioned flavor and then introduce two data_offset fields for each
other? This way the application is free to intermix device state and dirty
page collection for whatever needs.

> > + *
> > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-
> copy phase
> > + * and for _SAVING device state or stop-and-copy phase:
> > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > + * b. read data_offset, indicates kernel driver to write data to staging
> buffer.
> > + * c. read data_size, amount of data in bytes written by vendor driver in
> > + *    migration region.
> > + * d. read data_size bytes of data from data_offset in the migration region.
> > + * e. process data.
> > + * f. Loop through a to e.
> 
> Something needs to be said here about the availability of the data, for
> example, what indicates to the vendor driver that the above operation is
> complete?  Is the data immutable?

I guess the vendor driver just continues to track pending_bytes for dirtied 
device state, until exiting _SAVING state. Data is copied only when
pending_bytes is read by userspace. Copied data is immutable if data region
is mmapped. But yes, this part needs clarification.

> 
> > + *
> > + * To copy system memory content during migration, vendor driver should
> be able
> > + * to report system memory pages which are dirtied by that driver. For such
> > + * dirty page reporting, user application should query for a range of GFNs
> > + * relative to device address space (IOVA), then vendor driver should
> provide
> > + * the bitmap of pages from this range which are dirtied by him through
> > + * migration region where each bit represents a page and bit set to 1
> represents
> > + * that the page is dirty.
> > + * User space application should take care of copying content of system
> memory
> > + * for those pages.
> 
> Can we say that device state and dirty pfn operations on the data
> area may be intermixed in any order the user chooses?

this part is very opaque from previous description. Now I'm inclined to
vote for no-intermix design. There is really no good to force some
dependency between retrieving device state and dirty page bitmap.

> 
> Should we specify that bits accumulate since EITHER a) _SAVING state is
> enabled or b) a given pfn was last reported via the below sequence (ie.
> dirty bits are cleared once reported)?
> 
> How does QEMU handle the fact that IOVAs are potentially dynamic while
> performing the live portion of a migration?  For example, each time a
> guest driver calls dma_map_page() or dma_unmap_page(), a
> MemoryRegionSection pops in or out of the AddressSpace for the device
> (I'm assuming a vIOMMU where the device AddressSpace is not
> system_memory).  I don't see any QEMU code that intercepts that change
> in the AddressSpace such that the IOVA dirty pfns could be recorded and
> translated to GFNs.  The vendor driver can't track these beyond getting
> an unmap notification since it only knows the IOVA pfns, which can be
> re-used with different GFN backing.  Once the DMA mapping is torn down,
> it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> please help me find the code that handles it.

I'm curious about this part too. Interestingly, I didn't find any log_sync
callback registered by emulated devices in Qemu. Looks dirty pages
by emulated DMAs are recorded in some implicit way. But KVM always
reports dirty page in GFN instead of IOVA, regardless of the presence of
vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
 (translation can be done when DMA happens), then we don't need 
worry about transient mapping from IOVA to GFN. Along this way we
also want GFN-based dirty bitmap being reported through VFIO, 
similar to what KVM does. For vendor drivers, it needs to translate
from IOVA to HVA to GFN when tracking DMA activities on VFIO 
devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
provided by KVM but I'm not sure whether it's exposed now.

> 
> > + *
> > + * Steps to get dirty page bitmap:
> > + * a. write start_pfn, page_size and total_pfns.
> 
> This is not well specified.  Is it intended that the user write all
> three of these on every iteration, or could they write start_pfn=0,
> page_size=4K, total_pfns=1, complete the steps below, then write
> start_pfn=1 and immediately begin the next iteration?  They've written
> all three, though not all on the current iteration, does that count?
> Furthermore, could the user simple re-read copied_pfns to determine if
> anything in the previously setup range has been re-dirtied?
> 
> IOW, are these three "registers" sticky or do the below operations
> invalidate them?  If they're invalidated, then there needs to be a
> mechanism to generate an error, such as below.
> 
> > + * b. read copied_pfns. Vendor driver should take one of the below action:
> > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> driver
> > + *       doesn't have any page to report dirty in given range or rest of the
> > + *       range. Exit the loop.
> > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark
> all
> > + *       pages dirty for given range or rest of the range. User space
> > + *       application mark all pages in the range as dirty and exit the loop.
> > + *     - Vendor driver should return copied_pfns and provide bitmap for
> > + *       copied_pfn in migration region.
> 
> Read returns errno if the pre-requisite registers are not valid?
> 
> > + * c. read data_offset, where vendor driver has written bitmap.
> > + * d. read bitmap from the migration region from data_offset.
> > + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
> 
> It seems like the intent here is that the user effectively does:
> 
> start_pfn += copied_pfns
> total_pfns -= copied_pfns
> page_size = page_size?
> 
> But are they under any obligation to do so?
> 
> Also same question above regarding data availability/life-cycle.  Is
> the vendor driver responsible for making the data available
> indefinitely?  Seems it's only released at the next iteration, or
> re-use of the data area for another operation, or clearing of the
> _SAVING state bit.
> 
> > + * Sequence to be followed while _RESUMING device state:
> > + * While data for this device is available, repeat below steps:
> > + * a. read data_offset from where user application should write data.
> > + * b. write data of data_size to migration region from data_offset.
> > + * c. write data_size which indicates vendor driver that data is written in
> > + *    staging buffer.
> > + *
> > + * For user application, data is opaque. User should write data in the same
> > + * order as received.
> 
> Additionally, implicit synchronization between _SAVING and _RESUMING
> ends within the vendor driver is assumed.
> 
> Are there any assumptions we haven't covered with respect to mmaps?
> For instance, can the user setup mmaps at any time or only during
> certain device states?  Are there recommended best practices for users
> to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
> mechanism, it might be nice to use it when either of these bits are
> cleared.  Thanks,

another open for mmaps is how many pages to map. copied_pfns 
carries only the number of copied pages. They scatter in the bitmap
while application doesn't know how long the bitmap is. Do we
expect the application to map pages one-by-one, until scanned 
dirty pages is equal to copied_pfns?

> 
> Alex
> 
> > + */
> > +
> > +struct vfio_device_migration_info {
> > +        __u32 device_state;         /* VFIO device state */
> > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> > +                                     VFIO_DEVICE_STATE_SAVING | \
> > +                                     VFIO_DEVICE_STATE_RESUMING)
> > +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> > +                                     VFIO_DEVICE_STATE_RESUMING)
> > +        __u32 reserved;
> > +        __u64 pending_bytes;
> > +        __u64 data_offset;
> > +        __u64 data_size;
> > +        __u64 start_pfn;
> > +        __u64 page_size;
> > +        __u64 total_pfns;
> > +        __u64 copied_pfns;
> > +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> > +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> > +} __attribute__((packed));
> > +
> >  /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be
> mmapped
> >   * which allows direct access to non-MSIX registers which happened to be
> within



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
       [not found]     ` <AADFC41AFE54684AB9EE6CBC0274A5D19D553133@SHSMSX104.ccr.corp.intel.com>
@ 2019-08-30  8:06       ` Tian, Kevin
  2019-08-30 16:32         ` Alex Williamson
  0 siblings, 1 reply; 34+ messages in thread
From: Tian, Kevin @ 2019-08-30  8:06 UTC (permalink / raw)
  To: 'Alex Williamson', Kirti Wankhede
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, eauger, felipe, jonathan.davies, Zhao, Yan Y, Liu,
	Changpeng, Ken.Xue

> From: Tian, Kevin
> Sent: Friday, August 30, 2019 3:26 PM
> 
[...]
> > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > performing the live portion of a migration?  For example, each time a
> > guest driver calls dma_map_page() or dma_unmap_page(), a
> > MemoryRegionSection pops in or out of the AddressSpace for the device
> > (I'm assuming a vIOMMU where the device AddressSpace is not
> > system_memory).  I don't see any QEMU code that intercepts that change
> > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > translated to GFNs.  The vendor driver can't track these beyond getting
> > an unmap notification since it only knows the IOVA pfns, which can be
> > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > please help me find the code that handles it.
> 
> I'm curious about this part too. Interestingly, I didn't find any log_sync
> callback registered by emulated devices in Qemu. Looks dirty pages
> by emulated DMAs are recorded in some implicit way. But KVM always
> reports dirty page in GFN instead of IOVA, regardless of the presence of
> vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
>  (translation can be done when DMA happens), then we don't need
> worry about transient mapping from IOVA to GFN. Along this way we
> also want GFN-based dirty bitmap being reported through VFIO,
> similar to what KVM does. For vendor drivers, it needs to translate
> from IOVA to HVA to GFN when tracking DMA activities on VFIO
> devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> provided by KVM but I'm not sure whether it's exposed now.
> 

HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.

Above flow works for software-tracked dirty mechanism, e.g. in
KVMGT, where GFN-based 'dirty' is marked when a guest page is 
mapped into device mmu. IOVA->HPA->GFN translation is done 
at that time, thus immune from further IOVA->GFN changes.

When hardware IOMMU supports D-bit in 2nd level translation (e.g.
VT-d rev3.0), there are two scenarios:

1) nested translation: guest manages 1st-level translation (IOVA->GPA)
and host manages 2nd-level translation (GPA->HPA). The 2nd-level
is not affected by guest mapping operations. So it's OK for IOMMU
driver to retrieve GFN-based dirty pages by directly scanning the 2nd-
level structure, upon request from user space. 

2) shadowed translation (IOVA->HPA) in 2nd level: in such case the dirty
information is tied to IOVA. the IOMMU driver is expected to maintain
an internal dirty bitmap. Upon any change of IOVA->GPA notification
from VFIO, the IOMMU driver should flush dirty status of affected 2nd-level
entries to the internal GFN-based bitmap. At this time, again IOVA->HVA
->GPA translation required for GFN-based recording. When userspace 
queries dirty bitmap, the IOMMU driver needs to flush latest 2nd-level 
dirty status to internal bitmap, which is then copied to user space.

Given the trickiness of 2), we aim to enable 1) on intel-iommu driver.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-08-30  7:25     ` Tian, Kevin
@ 2019-08-30 16:15       ` Alex Williamson
  2019-09-03  6:05         ` Tian, Kevin
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Williamson @ 2019-08-30 16:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang, Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

On Fri, 30 Aug 2019 07:25:59 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, August 29, 2019 4:51 AM
> > 
> > On Tue, 27 Aug 2019 00:25:41 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > - Defined MIGRATION region type and sub-type.
> > > - Used 3 bits to define VFIO device states.
> > >     Bit 0 => _RUNNING
> > >     Bit 1 => _SAVING
> > >     Bit 2 => _RESUMING
> > >     Combination of these bits defines VFIO device's state during migration
> > >     _STOPPED => All bits 0 indicates VFIO device stopped.
> > >     _RUNNING => Normal VFIO device running state.
> > >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but  
> > start  
> > >                           saving state of device i.e. pre-copy state
> > >     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> > >                           save device state,i.e. stop-n-copy state
> > >     _RESUMING => VFIO device resuming state.
> > >     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits  
> > are set  
> > >     Bits 3 - 31 are reserved for future use. User should perform
> > >     read-modify-write operation on this field.
> > > - Defined vfio_device_migration_info structure which will be placed at 0th
> > >   offset of migration region to get/set VFIO device related information.
> > >   Defined members of structure and usage on read/write access:
> > >     * device_state: (read/write)
> > >         To convey VFIO device state to be transitioned to. Only 3 bits are used
> > >         as of now, Bits 3 - 31 are reserved for future use.
> > >     * pending bytes: (read only)
> > >         To get pending bytes yet to be migrated for VFIO device.
> > >     * data_offset: (read only)
> > >         To get data offset in migration region from where data exist during
> > >         _SAVING, from where data should be written by user space application
> > >         during _RESUMING state and while read dirty pages bitmap.
> > >     * data_size: (read/write)
> > >         To get and set size of data copied in migration region during _SAVING
> > >         and _RESUMING state.
> > >     * start_pfn, page_size, total_pfns: (write only)
> > >         To get bitmap of dirty pages from vendor driver from given
> > >         start address for total_pfns.
> > >     * copied_pfns: (read only)
> > >         To get number of pfns bitmap copied in migration region.
> > >         Vendor driver should copy the bitmap with bits set only for
> > >         pages to be marked dirty in migration region. Vendor driver
> > >         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages  
> > dirty in  
> > >         requested range. Vendor driver should return  
> > VFIO_DEVICE_DIRTY_PFNS_ALL  
> > >         to mark all pages in the section as dirty.
> > >
> > > Migration region looks like:
> > >  ------------------------------------------------------------------
> > > |vfio_device_migration_info|    data section                      |
> > > |                          |     ///////////////////////////////  |
> > >  ------------------------------------------------------------------
> > >  ^                              ^                              ^
> > >  offset 0-trapped part        data_offset                 data_size
> > >
> > > Data section is always followed by vfio_device_migration_info
> > > structure in the region, so data_offset will always be non-0.
> > > Offset from where data is copied is decided by kernel driver, data
> > > section can be trapped or mapped depending on how kernel driver
> > > defines data section. If mmapped, then data_offset should be page
> > > aligned, where as initial section which contain vfio_device_migration_info
> > > structure might not end at offset which is page aligned.
> > > Data_offset can be same or different for device data and dirty pages bitmap.
> > > Vendor driver should decide whether to partition data section and how to
> > > partition the data section. Vendor driver should return data_offset
> > > accordingly.
> > >
> > > For user application, data is opaque. User should write data in the same
> > > order as received.
> > >
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > ---
> > >  linux-headers/linux/vfio.h | 148  
> > +++++++++++++++++++++++++++++++++++++++++++++  
> > >  1 file changed, 148 insertions(+)
> > >
> > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > > index 24f505199f83..4bc0236b0898 100644
> > > --- a/linux-headers/linux/vfio.h
> > > +++ b/linux-headers/linux/vfio.h
> > > @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
> > >   */
> > >  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> > >
> > > +/* Migration region type and sub-type */
> > > +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> > > +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> > > +
> > > +/**
> > > + * Structure vfio_device_migration_info is placed at 0th offset of
> > > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device  
> > related migration  
> > > + * information. Field accesses from this structure are only supported at  
> > their  
> > > + * native width and alignment, otherwise the result is undefined and  
> > vendor  
> > > + * drivers should return an error.
> > > + *
> > > + * device_state: (read/write)
> > > + *      To indicate vendor driver the state VFIO device should be  
> > transitioned  
> > > + *      to. If device state transition fails, write on this field return error.
> > > + *      It consists of 3 bits:
> > > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> > > + *        _STOPPED state. When device is changed to _STOPPED, driver should  
> > stop  
> > > + *        device before write() returns.
> > > + *      - If bit 1 set, indicates _SAVING state.
> > > + *      - If bit 2 set, indicates _RESUMING state.  
> 
> please add a few words to explain _SAVING and _RESUMING, similar to 
> what you did for _RUNNING.
> 
> > > + *      Bits 3 - 31 are reserved for future use. User should perform
> > > + *      read-modify-write operation on this field.
> > > + *      _SAVING and _RESUMING bits set at the same time is invalid state.  
> 
> what about _RUNNING | _RESUMING? Is it allowed?

I think this would be post-copy migration, which I assume is
theoretically supportable, but not necessarily (and not currently)
supported by vendor drivers.  I'm not sure how a vendor driver reports
the lack of this support though.

> > > + *
> > > + * pending bytes: (read only)
> > > + *      Number of pending bytes yet to be migrated from vendor driver
> > > + *
> > > + * data_offset: (read only)
> > > + *      User application should read data_offset in migration region from  
> > where  
> > > + *      user application should read device data during _SAVING state or  
> > write  
> > > + *      device data during _RESUMING state or read dirty pages bitmap. See  
> > below  
> > > + *      for detail of sequence to be followed.
> > > + *
> > > + * data_size: (read/write)
> > > + *      User application should read data_size to get size of data copied in
> > > + *      migration region during _SAVING state and write size of data copied  
> > in  
> > > + *      migration region during _RESUMING state.
> > > + *
> > > + * start_pfn: (write only)
> > > + *      Start address pfn to get bitmap of dirty pages from vendor driver  
> > duing  
> > > + *      _SAVING state.
> > > + *
> > > + * page_size: (write only)
> > > + *      User application should write the page_size of pfn.
> > > + *
> > > + * total_pfns: (write only)
> > > + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> > > + *
> > > + * copied_pfns: (read only)  
> 
> 'copied' gives the impression as if the page content is copied. what about
> 'dirty_pfns'? btw can this field merge with total_pfns?

I don't agree with the implication that copied implies page content,
we're working with a bitmap, so this is easily disproved.  I think the
intent of 'copied' is to differentiate that 'total' is how many the
user asked for, and 'copied' is how many they got.  It's not obvious to
me whether there's anything that would prevent duplicate use of one
register, ie. write total, read copied.  I'm not sure I see a huge
advantage to it either though.
 
> > > + *      pfn count for which dirty bitmap is copied to migration region.
> > > + *      Vendor driver should copy the bitmap with bits set only for pages to  
> > be  
> > > + *      marked dirty in migration region.
> > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if  
> > none of the  
> > > + *        pages are dirty in requested range or rest of the range.
> > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark  
> > all  
> > > + *        pages dirty in the given range or rest of the range.
> > > + *      - Vendor driver should return pfn count for which bitmap is written in
> > > + *        the region.
> > > + *
> > > + * Migration region looks like:
> > > + *  ------------------------------------------------------------------
> > > + * |vfio_device_migration_info|    data section                      |
> > > + * |                          |     ///////////////////////////////  |
> > > + * ------------------------------------------------------------------
> > > + *   ^                              ^                             ^
> > > + *  offset 0-trapped part        data_offset                 data_size  
> 
> 'data_size' -> "data_offset + data_size"

The diagram sort of implies that, but I don't think this is correct.  I
believe it's more like:

data_offset = data_start_offset
data_size = data_end_offset - data_start_offset

> > > + *
> > > + * Data section is always followed by vfio_device_migration_info structure  
> 
> Data section is always following ..., or ... structure is always followed by
> data section.

The vfio_device_migration struct is always at the zero'th offset of the
region, the data section is theoretically the remainder, but the used
range of the data section is defined by the vendor driver via
data_offset.

> > > + * in the region, so data_offset will always be non-0. Offset from where  
> > data  
> > > + * is copied is decided by kernel driver, data section can be trapped or
> > > + * mapped or partitioned, depending on how kernel driver defines data  
> > section.  
> > > + * Data section partition can be defined as mapped by sparse mmap  
> > capability.  
> > > + * If mmapped, then data_offset should be page aligned, where as initial  
> > section  
> > > + * which contain vfio_device_migration_info structure might not end at  
> > offset  
> > > + * which is page aligned.
> > > + * Data_offset can be same or different for device data and dirty pages  
> > bitmap.  
> > > + * Vendor driver should decide whether to partition data section and how  
> > to  
> > > + * partition the data section. Vendor driver should return data_offset
> > > + * accordingly.  
> 
> Here lacks of high-level summary about how to differentiate reading device
> state from reading dirty page bitmap, when both use the same interface 
> (data_offset) to convey information to user space.
> 
> From below sequence example, looks reading device state is initiated by
> reading pending_bytes, while reading dirty bitmap is marked by writing
> start_pfn. In case of shared data region between two operations, they have
> to be mutually-exclusive i.e. one must wait for the other to complete. Even
> when the region is partitioned, data_offset itself could be raced if pending_
> bytes and start_pfn are accessed at the same time. How do we expect the
> application to cope with it? Isn't it too limiting with such design?
> 
> Since you anyway invent different sets of fields for two operations, why not
> forcing partitioned flavor and then introduce two data_offset fields for each
> other? This way the application is free to intermix device state and dirty
> page collection for whatever needs.

AIUI, it's the user's responsibility to consume the data they've
asked to be provided before they perform the next operation, but the
user can alternate between device state and dirty pages at will.  I
agree though that the lifecycle of the data with regard to the vendor
driver is lacking.  Nothing seems to indicate to the vendor driver that
the data is consumed other than starting the next operation or turning
off _SAVING.

> > > + *
> > > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-  
> > copy phase  
> > > + * and for _SAVING device state or stop-and-copy phase:
> > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > > + * b. read data_offset, indicates kernel driver to write data to staging  
> > buffer.  
> > > + * c. read data_size, amount of data in bytes written by vendor driver in
> > > + *    migration region.
> > > + * d. read data_size bytes of data from data_offset in the migration region.
> > > + * e. process data.
> > > + * f. Loop through a to e.  
> > 
> > Something needs to be said here about the availability of the data, for
> > example, what indicates to the vendor driver that the above operation is
> > complete?  Is the data immutable?  
> 
> I guess the vendor driver just continues to track pending_bytes for dirtied 
> device state, until exiting _SAVING state. Data is copied only when
> pending_bytes is read by userspace. Copied data is immutable if data region
> is mmapped. But yes, this part needs clarification.
> 
> >   
> > > + *
> > > + * To copy system memory content during migration, vendor driver should  
> > be able  
> > > + * to report system memory pages which are dirtied by that driver. For such
> > > + * dirty page reporting, user application should query for a range of GFNs
> > > + * relative to device address space (IOVA), then vendor driver should  
> > provide  
> > > + * the bitmap of pages from this range which are dirtied by him through
> > > + * migration region where each bit represents a page and bit set to 1  
> > represents  
> > > + * that the page is dirty.
> > > + * User space application should take care of copying content of system  
> > memory  
> > > + * for those pages.  
> > 
> > Can we say that device state and dirty pfn operations on the data
> > area may be intermixed in any order the user chooses?  
> 
> this part is very opaque from previous description. Now I'm inclined to
> vote for no-intermix design. There is really no good to force some
> dependency between retrieving device state and dirty page bitmap.

I'm confused, the state of the device continues to change so long as it
is _RUNNING.  The device may also continue to dirty pages in the
_RUNNING state.  So how could we not have both occurring at the same
time and therefore must support the user deciding which to sample in
each iteration?

> > Should we specify that bits accumulate since EITHER a) _SAVING state is
> > enabled or b) a given pfn was last reported via the below sequence (ie.
> > dirty bits are cleared once reported)?
> > 
> > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > performing the live portion of a migration?  For example, each time a
> > guest driver calls dma_map_page() or dma_unmap_page(), a
> > MemoryRegionSection pops in or out of the AddressSpace for the device
> > (I'm assuming a vIOMMU where the device AddressSpace is not
> > system_memory).  I don't see any QEMU code that intercepts that change
> > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > translated to GFNs.  The vendor driver can't track these beyond getting
> > an unmap notification since it only knows the IOVA pfns, which can be
> > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > please help me find the code that handles it.  
> 
> I'm curious about this part too. Interestingly, I didn't find any log_sync
> callback registered by emulated devices in Qemu. Looks dirty pages
> by emulated DMAs are recorded in some implicit way. But KVM always
> reports dirty page in GFN instead of IOVA, regardless of the presence of
> vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
>  (translation can be done when DMA happens), then we don't need 
> worry about transient mapping from IOVA to GFN. Along this way we
> also want GFN-based dirty bitmap being reported through VFIO, 
> similar to what KVM does. For vendor drivers, it needs to translate
> from IOVA to HVA to GFN when tracking DMA activities on VFIO 
> devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> provided by KVM but I'm not sure whether it's exposed now.
> 
> >   
> > > + *
> > > + * Steps to get dirty page bitmap:
> > > + * a. write start_pfn, page_size and total_pfns.  
> > 
> > This is not well specified.  Is it intended that the user write all
> > three of these on every iteration, or could they write start_pfn=0,
> > page_size=4K, total_pfns=1, complete the steps below, then write
> > start_pfn=1 and immediately begin the next iteration?  They've written
> > all three, though not all on the current iteration, does that count?
> > Furthermore, could the user simple re-read copied_pfns to determine if
> > anything in the previously setup range has been re-dirtied?
> > 
> > IOW, are these three "registers" sticky or do the below operations
> > invalidate them?  If they're invalidated, then there needs to be a
> > mechanism to generate an error, such as below.
> >   
> > > + * b. read copied_pfns. Vendor driver should take one of the below action:
> > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if  
> > driver  
> > > + *       doesn't have any page to report dirty in given range or rest of the
> > > + *       range. Exit the loop.
> > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark  
> > all  
> > > + *       pages dirty for given range or rest of the range. User space
> > > + *       application mark all pages in the range as dirty and exit the loop.
> > > + *     - Vendor driver should return copied_pfns and provide bitmap for
> > > + *       copied_pfn in migration region.  
> > 
> > Read returns errno if the pre-requisite registers are not valid?
> >   
> > > + * c. read data_offset, where vendor driver has written bitmap.
> > > + * d. read bitmap from the migration region from data_offset.
> > > + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)  
> > 
> > It seems like the intent here is that the user effectively does:
> > 
> > start_pfn += copied_pfns
> > total_pfns -= copied_pfns
> > page_size = page_size?
> > 
> > But are they under any obligation to do so?
> > 
> > Also same question above regarding data availability/life-cycle.  Is
> > the vendor driver responsible for making the data available
> > indefinitely?  Seems it's only released at the next iteration, or
> > re-use of the data area for another operation, or clearing of the
> > _SAVING state bit.
> >   
> > > + * Sequence to be followed while _RESUMING device state:
> > > + * While data for this device is available, repeat below steps:
> > > + * a. read data_offset from where user application should write data.
> > > + * b. write data of data_size to migration region from data_offset.
> > > + * c. write data_size which indicates vendor driver that data is written in
> > > + *    staging buffer.
> > > + *
> > > + * For user application, data is opaque. User should write data in the same
> > > + * order as received.  
> > 
> > Additionally, implicit synchronization between _SAVING and _RESUMING
> > ends within the vendor driver is assumed.
> > 
> > Are there any assumptions we haven't covered with respect to mmaps?
> > For instance, can the user setup mmaps at any time or only during
> > certain device states?  Are there recommended best practices for users
> > to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
> > mechanism, it might be nice to use it when either of these bits are
> > cleared.  Thanks,  
> 
> another open for mmaps is how many pages to map. copied_pfns 
> carries only the number of copied pages. They scatter in the bitmap
> while application doesn't know how long the bitmap is. Do we
> expect the application to map pages one-by-one, until scanned 
> dirty pages is equal to copied_pfns?

I think we expect userspace to mmap according to the sparse mmap
capability of the region.  The vendor driver then chooses whether to
expose the current data set within the mmap region, outside the mmap
region, or some mix of both, depending on their requirements.  Thanks,

Alex
 
> > > + */
> > > +
> > > +struct vfio_device_migration_info {
> > > +        __u32 device_state;         /* VFIO device state */
> > > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> > > +                                     VFIO_DEVICE_STATE_SAVING | \
> > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > +        __u32 reserved;
> > > +        __u64 pending_bytes;
> > > +        __u64 data_offset;
> > > +        __u64 data_size;
> > > +        __u64 start_pfn;
> > > +        __u64 page_size;
> > > +        __u64 total_pfns;
> > > +        __u64 copied_pfns;
> > > +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> > > +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> > > +} __attribute__((packed));
> > > +
> > >  /*
> > >   * The MSIX mappable capability informs that MSIX data of a BAR can be  
> > mmapped  
> > >   * which allows direct access to non-MSIX registers which happened to be  
> > within  
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-08-30  8:06       ` Tian, Kevin
@ 2019-08-30 16:32         ` Alex Williamson
  2019-09-03  6:57           ` Tian, Kevin
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Williamson @ 2019-08-30 16:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang, Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

On Fri, 30 Aug 2019 08:06:32 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Tian, Kevin
> > Sent: Friday, August 30, 2019 3:26 PM
> >   
> [...]
> > > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > > performing the live portion of a migration?  For example, each time a
> > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > system_memory).  I don't see any QEMU code that intercepts that change
> > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > an unmap notification since it only knows the IOVA pfns, which can be
> > > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > please help me find the code that handles it.  
> > 
> > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > callback registered by emulated devices in Qemu. Looks dirty pages
> > by emulated DMAs are recorded in some implicit way. But KVM always
> > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> >  (translation can be done when DMA happens), then we don't need
> > worry about transient mapping from IOVA to GFN. Along this way we
> > also want GFN-based dirty bitmap being reported through VFIO,
> > similar to what KVM does. For vendor drivers, it needs to translate
> > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > provided by KVM but I'm not sure whether it's exposed now.
> >   
> 
> HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.

I thought it was bad enough that we have vendor drivers that depend on
KVM, but designing a vfio interface that only supports a KVM interface
is more undesirable.  I also note without comment that gfn_to_memslot()
is a GPL symbol.  Thanks,

Alex

> Above flow works for software-tracked dirty mechanism, e.g. in
> KVMGT, where GFN-based 'dirty' is marked when a guest page is 
> mapped into device mmu. IOVA->HPA->GFN translation is done 
> at that time, thus immune from further IOVA->GFN changes.
> 
> When hardware IOMMU supports D-bit in 2nd level translation (e.g.
> VT-d rev3.0), there are two scenarios:
> 
> 1) nested translation: guest manages 1st-level translation (IOVA->GPA)
> and host manages 2nd-level translation (GPA->HPA). The 2nd-level
> is not affected by guest mapping operations. So it's OK for IOMMU
> driver to retrieve GFN-based dirty pages by directly scanning the 2nd-
> level structure, upon request from user space. 
> 
> 2) shadowed translation (IOVA->HPA) in 2nd level: in such case the dirty
> information is tied to IOVA. the IOMMU driver is expected to maintain
> an internal dirty bitmap. Upon any change of IOVA->GPA notification
> from VFIO, the IOMMU driver should flush dirty status of affected 2nd-level
> entries to the internal GFN-based bitmap. At this time, again IOVA->HVA
> ->GPA translation required for GFN-based recording. When userspace   
> queries dirty bitmap, the IOMMU driver needs to flush latest 2nd-level 
> dirty status to internal bitmap, which is then copied to user space.
> 
> Given the trickiness of 2), we aim to enable 1) on intel-iommu driver.
> 
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-08-30 16:15       ` Alex Williamson
@ 2019-09-03  6:05         ` Tian, Kevin
  2019-09-04  8:28           ` Yan Zhao
  0 siblings, 1 reply; 34+ messages in thread
From: Tian, Kevin @ 2019-09-03  6:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Saturday, August 31, 2019 12:15 AM
> 
> On Fri, 30 Aug 2019 07:25:59 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, August 29, 2019 4:51 AM
> > >
> > > On Tue, 27 Aug 2019 00:25:41 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > > > - Defined MIGRATION region type and sub-type.
> > > > - Used 3 bits to define VFIO device states.
> > > >     Bit 0 => _RUNNING
> > > >     Bit 1 => _SAVING
> > > >     Bit 2 => _RESUMING
> > > >     Combination of these bits defines VFIO device's state during migration
> > > >     _STOPPED => All bits 0 indicates VFIO device stopped.
> > > >     _RUNNING => Normal VFIO device running state.
> > > >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running
> but
> > > start
> > > >                           saving state of device i.e. pre-copy state
> > > >     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> > > >                           save device state,i.e. stop-n-copy state
> > > >     _RESUMING => VFIO device resuming state.
> > > >     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING
> bits
> > > are set
> > > >     Bits 3 - 31 are reserved for future use. User should perform
> > > >     read-modify-write operation on this field.
> > > > - Defined vfio_device_migration_info structure which will be placed at
> 0th
> > > >   offset of migration region to get/set VFIO device related information.
> > > >   Defined members of structure and usage on read/write access:
> > > >     * device_state: (read/write)
> > > >         To convey VFIO device state to be transitioned to. Only 3 bits are
> used
> > > >         as of now, Bits 3 - 31 are reserved for future use.
> > > >     * pending bytes: (read only)
> > > >         To get pending bytes yet to be migrated for VFIO device.
> > > >     * data_offset: (read only)
> > > >         To get data offset in migration region from where data exist during
> > > >         _SAVING, from where data should be written by user space
> application
> > > >         during _RESUMING state and while read dirty pages bitmap.
> > > >     * data_size: (read/write)
> > > >         To get and set size of data copied in migration region during
> _SAVING
> > > >         and _RESUMING state.
> > > >     * start_pfn, page_size, total_pfns: (write only)
> > > >         To get bitmap of dirty pages from vendor driver from given
> > > >         start address for total_pfns.
> > > >     * copied_pfns: (read only)
> > > >         To get number of pfns bitmap copied in migration region.
> > > >         Vendor driver should copy the bitmap with bits set only for
> > > >         pages to be marked dirty in migration region. Vendor driver
> > > >         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages
> > > dirty in
> > > >         requested range. Vendor driver should return
> > > VFIO_DEVICE_DIRTY_PFNS_ALL
> > > >         to mark all pages in the section as dirty.
> > > >
> > > > Migration region looks like:
> > > >  ------------------------------------------------------------------
> > > > |vfio_device_migration_info|    data section                      |
> > > > |                          |     ///////////////////////////////  |
> > > >  ------------------------------------------------------------------
> > > >  ^                              ^                              ^
> > > >  offset 0-trapped part        data_offset                 data_size
> > > >
> > > > Data section is always followed by vfio_device_migration_info
> > > > structure in the region, so data_offset will always be non-0.
> > > > Offset from where data is copied is decided by kernel driver, data
> > > > section can be trapped or mapped depending on how kernel driver
> > > > defines data section. If mmapped, then data_offset should be page
> > > > aligned, where as initial section which contain
> vfio_device_migration_info
> > > > structure might not end at offset which is page aligned.
> > > > Data_offset can be same or different for device data and dirty pages
> bitmap.
> > > > Vendor driver should decide whether to partition data section and how
> to
> > > > partition the data section. Vendor driver should return data_offset
> > > > accordingly.
> > > >
> > > > For user application, data is opaque. User should write data in the same
> > > > order as received.
> > > >
> > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > ---
> > > >  linux-headers/linux/vfio.h | 148
> > > +++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 148 insertions(+)
> > > >
> > > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > > > index 24f505199f83..4bc0236b0898 100644
> > > > --- a/linux-headers/linux/vfio.h
> > > > +++ b/linux-headers/linux/vfio.h
> > > > @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
> > > >   */
> > > >  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> > > >
> > > > +/* Migration region type and sub-type */
> > > > +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> > > > +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> > > > +
> > > > +/**
> > > > + * Structure vfio_device_migration_info is placed at 0th offset of
> > > > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
> > > related migration
> > > > + * information. Field accesses from this structure are only supported at
> > > their
> > > > + * native width and alignment, otherwise the result is undefined and
> > > vendor
> > > > + * drivers should return an error.
> > > > + *
> > > > + * device_state: (read/write)
> > > > + *      To indicate vendor driver the state VFIO device should be
> > > transitioned
> > > > + *      to. If device state transition fails, write on this field return error.
> > > > + *      It consists of 3 bits:
> > > > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that
> indicates
> > > > + *        _STOPPED state. When device is changed to _STOPPED, driver
> should
> > > stop
> > > > + *        device before write() returns.
> > > > + *      - If bit 1 set, indicates _SAVING state.
> > > > + *      - If bit 2 set, indicates _RESUMING state.
> >
> > please add a few words to explain _SAVING and _RESUMING, similar to
> > what you did for _RUNNING.
> >
> > > > + *      Bits 3 - 31 are reserved for future use. User should perform
> > > > + *      read-modify-write operation on this field.
> > > > + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> >
> > what about _RUNNING | _RESUMING? Is it allowed?
> 
> I think this would be post-copy migration, which I assume is
> theoretically supportable, but not necessarily (and not currently)
> supported by vendor drivers.  I'm not sure how a vendor driver reports
> the lack of this support though.

Yan is working on post-copy now. I talked to her about this open.
She will respond later after some thinking. Ideally we need a way
that user space can use to verify which combinations are available 
before starting the post-copy process. It's not good to blindly start
post-copy and then fail due to failure to switch to desired state on
the dest machine.

> 
> > > > + *
> > > > + * pending bytes: (read only)
> > > > + *      Number of pending bytes yet to be migrated from vendor driver
> > > > + *
> > > > + * data_offset: (read only)
> > > > + *      User application should read data_offset in migration region from
> > > where
> > > > + *      user application should read device data during _SAVING state or
> > > write
> > > > + *      device data during _RESUMING state or read dirty pages bitmap.
> See
> > > below
> > > > + *      for detail of sequence to be followed.
> > > > + *
> > > > + * data_size: (read/write)
> > > > + *      User application should read data_size to get size of data copied
> in
> > > > + *      migration region during _SAVING state and write size of data
> copied
> > > in
> > > > + *      migration region during _RESUMING state.
> > > > + *
> > > > + * start_pfn: (write only)
> > > > + *      Start address pfn to get bitmap of dirty pages from vendor driver
> > > duing
> > > > + *      _SAVING state.
> > > > + *
> > > > + * page_size: (write only)
> > > > + *      User application should write the page_size of pfn.
> > > > + *
> > > > + * total_pfns: (write only)
> > > > + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> > > > + *
> > > > + * copied_pfns: (read only)
> >
> > 'copied' gives the impression as if the page content is copied. what about
> > 'dirty_pfns'? btw can this field merge with total_pfns?
> 
> I don't agree with the implication that copied implies page content,
> we're working with a bitmap, so this is easily disproved.  I think the
> intent of 'copied' is to differentiate that 'total' is how many the
> user asked for, and 'copied' is how many they got.  It's not obvious to
> me whether there's anything that would prevent duplicate use of one
> register, ie. write total, read copied.  I'm not sure I see a huge
> advantage to it either though.

It's fine. I just had the impression that many kernel interfaces are 
designed in such way, i.e., having the same field to carry-in user
requested number and then carry-out actually handled number.

> 
> > > > + *      pfn count for which dirty bitmap is copied to migration region.
> > > > + *      Vendor driver should copy the bitmap with bits set only for pages
> to
> > > be
> > > > + *      marked dirty in migration region.
> > > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> > > none of the
> > > > + *        pages are dirty in requested range or rest of the range.
> > > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to
> mark
> > > all
> > > > + *        pages dirty in the given range or rest of the range.
> > > > + *      - Vendor driver should return pfn count for which bitmap is
> written in
> > > > + *        the region.
> > > > + *
> > > > + * Migration region looks like:
> > > > + *  ------------------------------------------------------------------
> > > > + * |vfio_device_migration_info|    data section                      |
> > > > + * |                          |     ///////////////////////////////  |
> > > > + * ------------------------------------------------------------------
> > > > + *   ^                              ^                             ^
> > > > + *  offset 0-trapped part        data_offset                 data_size
> >
> > 'data_size' -> "data_offset + data_size"
> 
> The diagram sort of implies that, but I don't think this is correct.  I
> believe it's more like:
> 
> data_offset = data_start_offset
> data_size = data_end_offset - data_start_offset

I'm fine with replacing data_offset & data_size with data_start_
offset & data_end_offset.

> 
> > > > + *
> > > > + * Data section is always followed by vfio_device_migration_info
> structure
> >
> > Data section is always following ..., or ... structure is always followed by
> > data section.
> 
> The vfio_device_migration struct is always at the zero'th offset of the
> region, the data section is theoretically the remainder, but the used
> range of the data section is defined by the vendor driver via
> data_offset.

Isn't original description incorrect here? It implied data section sits at
offset zero, and then migration info...

> 
> > > > + * in the region, so data_offset will always be non-0. Offset from where
> > > data
> > > > + * is copied is decided by kernel driver, data section can be trapped or
> > > > + * mapped or partitioned, depending on how kernel driver defines data
> > > section.
> > > > + * Data section partition can be defined as mapped by sparse mmap
> > > capability.
> > > > + * If mmapped, then data_offset should be page aligned, where as
> initial
> > > section
> > > > + * which contain vfio_device_migration_info structure might not end at
> > > offset
> > > > + * which is page aligned.
> > > > + * Data_offset can be same or different for device data and dirty pages
> > > bitmap.
> > > > + * Vendor driver should decide whether to partition data section and
> how
> > > to
> > > > + * partition the data section. Vendor driver should return data_offset
> > > > + * accordingly.
> >
> > Here lacks of high-level summary about how to differentiate reading device
> > state from reading dirty page bitmap, when both use the same interface
> > (data_offset) to convey information to user space.
> >
> > From below sequence example, looks reading device state is initiated by
> > reading pending_bytes, while reading dirty bitmap is marked by writing
> > start_pfn. In case of shared data region between two operations, they have
> > to be mutually-exclusive i.e. one must wait for the other to complete. Even
> > when the region is partitioned, data_offset itself could be raced if pending_
> > bytes and start_pfn are accessed at the same time. How do we expect the
> > application to cope with it? Isn't it too limiting with such design?
> >
> > Since you anyway invent different sets of fields for two operations, why not
> > forcing partitioned flavor and then introduce two data_offset fields for each
> > other? This way the application is free to intermix device state and dirty
> > page collection for whatever needs.
> 
> AIUI, it's the user's responsibility to consume the data they've
> asked to be provided before they perform the next operation, but the
> user can alternate between device state and dirty pages at will.  I
> agree though that the lifecycle of the data with regard to the vendor
> driver is lacking.  Nothing seems to indicate to the vendor driver that
> the data is consumed other than starting the next operation or turning
> off _SAVING.
> 
> > > > + *
> > > > + * Sequence to be followed for _SAVING|_RUNNING device state or
> pre-
> > > copy phase
> > > > + * and for _SAVING device state or stop-and-copy phase:
> > > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > > > + * b. read data_offset, indicates kernel driver to write data to staging
> > > buffer.
> > > > + * c. read data_size, amount of data in bytes written by vendor driver in
> > > > + *    migration region.
> > > > + * d. read data_size bytes of data from data_offset in the migration
> region.
> > > > + * e. process data.
> > > > + * f. Loop through a to e.
> > >
> > > Something needs to be said here about the availability of the data, for
> > > example, what indicates to the vendor driver that the above operation is
> > > complete?  Is the data immutable?
> >
> > I guess the vendor driver just continues to track pending_bytes for dirtied
> > device state, until exiting _SAVING state. Data is copied only when
> > pending_bytes is read by userspace. Copied data is immutable if data
> region
> > is mmapped. But yes, this part needs clarification.
> >
> > >
> > > > + *
> > > > + * To copy system memory content during migration, vendor driver
> should
> > > be able
> > > > + * to report system memory pages which are dirtied by that driver. For
> such
> > > > + * dirty page reporting, user application should query for a range of
> GFNs
> > > > + * relative to device address space (IOVA), then vendor driver should
> > > provide
> > > > + * the bitmap of pages from this range which are dirtied by him through
> > > > + * migration region where each bit represents a page and bit set to 1
> > > represents
> > > > + * that the page is dirty.
> > > > + * User space application should take care of copying content of system
> > > memory
> > > > + * for those pages.
> > >
> > > Can we say that device state and dirty pfn operations on the data
> > > area may be intermixed in any order the user chooses?
> >
> > this part is very opaque from previous description. Now I'm inclined to
> > vote for no-intermix design. There is really no good to force some
> > dependency between retrieving device state and dirty page bitmap.
> 
> I'm confused, the state of the device continues to change so long as it
> is _RUNNING.  The device may also continue to dirty pages in the
> _RUNNING state.  So how could we not have both occurring at the same
> time and therefore must support the user deciding which to sample in
> each iteration?

I may not make it clear. Actually I'm with you on that device state and
dirty pages are changed in parallel in the _RUNNING state. and two are
irrelevant by definition. Then why did we design an interface, as this
patch suggested, having two states sharing the same region thus forcing 
the application to manage unnecessary contention in-between? For
example, if the application creates two threads, with one to collect 
dirty page bitmaps and the other to dump device state, such design 
requires the application to protect the data region between two 
threads - one thread has to wait until the other drains the new state 
reported by kernel driver, otherwise the data region will be messed.

On the other hand, if we explicitly design two sub-regions for device
state and dirty bitmap and split the control interface, two can
be enquired simultaneously. As earlier discussed, two states already
have most of their control fields separate except data_offset. Now 
just move one step further to have two data_offset, then two states
are completely decoupled and the application is lock-free to access 
any sub-region at any time.

> 
> > > Should we specify that bits accumulate since EITHER a) _SAVING state is
> > > enabled or b) a given pfn was last reported via the below sequence (ie.
> > > dirty bits are cleared once reported)?
> > >
> > > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > > performing the live portion of a migration?  For example, each time a
> > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > system_memory).  I don't see any QEMU code that intercepts that change
> > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > an unmap notification since it only knows the IOVA pfns, which can be
> > > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > please help me find the code that handles it.
> >
> > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > callback registered by emulated devices in Qemu. Looks dirty pages
> > by emulated DMAs are recorded in some implicit way. But KVM always
> > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> >  (translation can be done when DMA happens), then we don't need
> > worry about transient mapping from IOVA to GFN. Along this way we
> > also want GFN-based dirty bitmap being reported through VFIO,
> > similar to what KVM does. For vendor drivers, it needs to translate
> > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > provided by KVM but I'm not sure whether it's exposed now.
> >
> > >
> > > > + *
> > > > + * Steps to get dirty page bitmap:
> > > > + * a. write start_pfn, page_size and total_pfns.
> > >
> > > This is not well specified.  Is it intended that the user write all
> > > three of these on every iteration, or could they write start_pfn=0,
> > > page_size=4K, total_pfns=1, complete the steps below, then write
> > > start_pfn=1 and immediately begin the next iteration?  They've written
> > > all three, though not all on the current iteration, does that count?
> > > Furthermore, could the user simple re-read copied_pfns to determine if
> > > anything in the previously setup range has been re-dirtied?
> > >
> > > IOW, are these three "registers" sticky or do the below operations
> > > invalidate them?  If they're invalidated, then there needs to be a
> > > mechanism to generate an error, such as below.
> > >
> > > > + * b. read copied_pfns. Vendor driver should take one of the below
> action:
> > > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> > > driver
> > > > + *       doesn't have any page to report dirty in given range or rest of the
> > > > + *       range. Exit the loop.
> > > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to
> mark
> > > all
> > > > + *       pages dirty for given range or rest of the range. User space
> > > > + *       application mark all pages in the range as dirty and exit the loop.
> > > > + *     - Vendor driver should return copied_pfns and provide bitmap for
> > > > + *       copied_pfn in migration region.
> > >
> > > Read returns errno if the pre-requisite registers are not valid?
> > >
> > > > + * c. read data_offset, where vendor driver has written bitmap.
> > > > + * d. read bitmap from the migration region from data_offset.
> > > > + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
> > >
> > > It seems like the intent here is that the user effectively does:
> > >
> > > start_pfn += copied_pfns
> > > total_pfns -= copied_pfns
> > > page_size = page_size?
> > >
> > > But are they under any obligation to do so?
> > >
> > > Also same question above regarding data availability/life-cycle.  Is
> > > the vendor driver responsible for making the data available
> > > indefinitely?  Seems it's only released at the next iteration, or
> > > re-use of the data area for another operation, or clearing of the
> > > _SAVING state bit.
> > >
> > > > + * Sequence to be followed while _RESUMING device state:
> > > > + * While data for this device is available, repeat below steps:
> > > > + * a. read data_offset from where user application should write data.
> > > > + * b. write data of data_size to migration region from data_offset.
> > > > + * c. write data_size which indicates vendor driver that data is written
> in
> > > > + *    staging buffer.
> > > > + *
> > > > + * For user application, data is opaque. User should write data in the
> same
> > > > + * order as received.
> > >
> > > Additionally, implicit synchronization between _SAVING and _RESUMING
> > > ends within the vendor driver is assumed.
> > >
> > > Are there any assumptions we haven't covered with respect to mmaps?
> > > For instance, can the user setup mmaps at any time or only during
> > > certain device states?  Are there recommended best practices for users
> > > to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
> > > mechanism, it might be nice to use it when either of these bits are
> > > cleared.  Thanks,
> >
> > another open for mmaps is how many pages to map. copied_pfns
> > carries only the number of copied pages. They scatter in the bitmap
> > while application doesn't know how long the bitmap is. Do we
> > expect the application to map pages one-by-one, until scanned
> > dirty pages is equal to copied_pfns?
> 
> I think we expect userspace to mmap according to the sparse mmap
> capability of the region.  The vendor driver then chooses whether to
> expose the current data set within the mmap region, outside the mmap
> region, or some mix of both, depending on their requirements.  Thanks,

Thanks, it makes sense.

> 
> Alex
> 
> > > > + */
> > > > +
> > > > +struct vfio_device_migration_info {
> > > > +        __u32 device_state;         /* VFIO device state */
> > > > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > > > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > > > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > > > +#define VFIO_DEVICE_STATE_MASK
> (VFIO_DEVICE_STATE_RUNNING | \
> > > > +                                     VFIO_DEVICE_STATE_SAVING | \
> > > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > > +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING
> | \
> > > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > > +        __u32 reserved;
> > > > +        __u64 pending_bytes;
> > > > +        __u64 data_offset;
> > > > +        __u64 data_size;
> > > > +        __u64 start_pfn;
> > > > +        __u64 page_size;
> > > > +        __u64 total_pfns;
> > > > +        __u64 copied_pfns;
> > > > +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> > > > +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> > > > +} __attribute__((packed));
> > > > +
> > > >  /*
> > > >   * The MSIX mappable capability informs that MSIX data of a BAR can
> be
> > > mmapped
> > > >   * which allows direct access to non-MSIX registers which happened to
> be
> > > within
> >



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-08-30 16:32         ` Alex Williamson
@ 2019-09-03  6:57           ` Tian, Kevin
  2019-09-12 14:41             ` Alex Williamson
  0 siblings, 1 reply; 34+ messages in thread
From: Tian, Kevin @ 2019-09-03  6:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Saturday, August 31, 2019 12:33 AM
> 
> On Fri, 30 Aug 2019 08:06:32 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Tian, Kevin
> > > Sent: Friday, August 30, 2019 3:26 PM
> > >
> > [...]
> > > > How does QEMU handle the fact that IOVAs are potentially dynamic
> while
> > > > performing the live portion of a migration?  For example, each time a
> > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > system_memory).  I don't see any QEMU code that intercepts that
> change
> > > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > > an unmap notification since it only knows the IOVA pfns, which can be
> > > > re-used with different GFN backing.  Once the DMA mapping is torn
> down,
> > > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > > please help me find the code that handles it.
> > >
> > > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > by emulated DMAs are recorded in some implicit way. But KVM always
> > > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > >  (translation can be done when DMA happens), then we don't need
> > > worry about transient mapping from IOVA to GFN. Along this way we
> > > also want GFN-based dirty bitmap being reported through VFIO,
> > > similar to what KVM does. For vendor drivers, it needs to translate
> > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > provided by KVM but I'm not sure whether it's exposed now.
> > >
> >
> > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.
> 
> I thought it was bad enough that we have vendor drivers that depend on
> KVM, but designing a vfio interface that only supports a KVM interface
> is more undesirable.  I also note without comment that gfn_to_memslot()
> is a GPL symbol.  Thanks,

yes it is bad, but sometimes inevitable. If you recall our discussions
back to 3yrs (when discussing the 1st mdev framework), there were similar
hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
creating some shadow structures. gpa->hpa is definitely hypervisor
specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
hypercall in Xen. but VFIO already makes assumption based on KVM-
only flavor when implementing vfio_{un}pin_page_external. So GVT-g
has to maintain an internal abstraction layer to support both Xen and
KVM. Maybe someday we will re-consider introducing some hypervisor
abstraction layer in VFIO, if this issue starts to hurt other devices and
Xen guys are willing to support VFIO.

Back to this IOVA issue, I discussed with Yan and we found another 
hypervisor-agnostic alternative, by learning from vhost. vhost is very
similar to VFIO - DMA also happens in the kernel, while it already 
supports vIOMMU.

Generally speaking, there are three paths of dirty page collection
in Qemu so far (as previously noted, Qemu always tracks the dirty
bitmap in GFN):

1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps 
are updated directly when the guest memory is being updated. For 
example, PCI writes are completed through pci_dma_write, which 
goes through vIOMMU to translate IOVA into GPA and then update 
the bitmap through cpu_physical_memory_set_dirty_range.

2) Memory writes that are not tracked by Qemu are collected by
registering .log_sync() callback, which is invoked in the dirty logging
process. Now there are two users: kvm and vhost.

  2.1) KVM tracks CPU-side memory writes, through write-protection
or EPT A/D bits (+PML). This part is always based on GFN and returned
to Qemu when kvm_log_sync is invoked;

  2.2) vhost tracks kernel-side DMA writes, by interpreting vring
data structure. It maintains an internal iotlb which is synced with
Qemu vIOMMU through a specific interface:
	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
for Qemu to keep vhost iotlb in sync
	- new VHOST_IOTLB_MISS message to notify Qemu in case of
a miss in vhost iotlb.
	- Qemu registers a log buffer to kernel vhost driver. The latter
update the buffer (using internal iotlb to get GFN) when serving vring
descriptor.

VFIO could also implement an internal iotlb, so vendor drivers can
utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
don't need re-invent another iotlb protocol as vhost does. vIOMMU
already sends map/unmap ioctl cmds upon any change of IOVA
mapping. We may introduce a v2 map/unmap interface, allowing
Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
doesn't want to cache full-size vIOMMU mappings. 

Definitely this alternative needs more work and possibly less 
performant (if maintaining a small size iotlb) than straightforward
calling into KVM interface. But the gain is also obvious, since it
is fully constrained with VFIO.

Thoughts? :-)

Thanks
Kevin

> 
> Alex
> 
> > Above flow works for software-tracked dirty mechanism, e.g. in
> > KVMGT, where GFN-based 'dirty' is marked when a guest page is
> > mapped into device mmu. IOVA->HPA->GFN translation is done
> > at that time, thus immune from further IOVA->GFN changes.
> >
> > When hardware IOMMU supports D-bit in 2nd level translation (e.g.
> > VT-d rev3.0), there are two scenarios:
> >
> > 1) nested translation: guest manages 1st-level translation (IOVA->GPA)
> > and host manages 2nd-level translation (GPA->HPA). The 2nd-level
> > is not affected by guest mapping operations. So it's OK for IOMMU
> > driver to retrieve GFN-based dirty pages by directly scanning the 2nd-
> > level structure, upon request from user space.
> >
> > 2) shadowed translation (IOVA->HPA) in 2nd level: in such case the dirty
> > information is tied to IOVA. the IOMMU driver is expected to maintain
> > an internal dirty bitmap. Upon any change of IOVA->GPA notification
> > from VFIO, the IOMMU driver should flush dirty status of affected 2nd-level
> > entries to the internal GFN-based bitmap. At this time, again IOVA->HVA
> > ->GPA translation required for GFN-based recording. When userspace
> > queries dirty bitmap, the IOMMU driver needs to flush latest 2nd-level
> > dirty status to internal bitmap, which is then copied to user space.
> >
> > Given the trickiness of 2), we aim to enable 1) on intel-iommu driver.
> >
> > Thanks
> > Kevin



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-03  6:05         ` Tian, Kevin
@ 2019-09-04  8:28           ` Yan Zhao
  0 siblings, 0 replies; 34+ messages in thread
From: Yan Zhao @ 2019-09-04  8:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	Ken.Xue, cohuck, shuangtai.tst, Kirti Wankhede, Wang, Zhi A,
	mlevitsk, pasic, aik, Alex Williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, dgilbert

On Tue, Sep 03, 2019 at 02:05:46PM +0800, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Saturday, August 31, 2019 12:15 AM
> > 
> > On Fri, 30 Aug 2019 07:25:59 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, August 29, 2019 4:51 AM
> > > >
> > > > On Tue, 27 Aug 2019 00:25:41 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >
> > > > > - Defined MIGRATION region type and sub-type.
> > > > > - Used 3 bits to define VFIO device states.
> > > > >     Bit 0 => _RUNNING
> > > > >     Bit 1 => _SAVING
> > > > >     Bit 2 => _RESUMING
> > > > >     Combination of these bits defines VFIO device's state during migration
> > > > >     _STOPPED => All bits 0 indicates VFIO device stopped.
> > > > >     _RUNNING => Normal VFIO device running state.
> > > > >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running
> > but
> > > > start
> > > > >                           saving state of device i.e. pre-copy state
> > > > >     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> > > > >                           save device state,i.e. stop-n-copy state
> > > > >     _RESUMING => VFIO device resuming state.
> > > > >     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING
> > bits
> > > > are set
> > > > >     Bits 3 - 31 are reserved for future use. User should perform
> > > > >     read-modify-write operation on this field.
> > > > > - Defined vfio_device_migration_info structure which will be placed at
> > 0th
> > > > >   offset of migration region to get/set VFIO device related information.
> > > > >   Defined members of structure and usage on read/write access:
> > > > >     * device_state: (read/write)
> > > > >         To convey VFIO device state to be transitioned to. Only 3 bits are
> > used
> > > > >         as of now, Bits 3 - 31 are reserved for future use.
> > > > >     * pending bytes: (read only)
> > > > >         To get pending bytes yet to be migrated for VFIO device.
> > > > >     * data_offset: (read only)
> > > > >         To get data offset in migration region from where data exist during
> > > > >         _SAVING, from where data should be written by user space
> > application
> > > > >         during _RESUMING state and while read dirty pages bitmap.
> > > > >     * data_size: (read/write)
> > > > >         To get and set size of data copied in migration region during
> > _SAVING
> > > > >         and _RESUMING state.
> > > > >     * start_pfn, page_size, total_pfns: (write only)
> > > > >         To get bitmap of dirty pages from vendor driver from given
> > > > >         start address for total_pfns.
> > > > >     * copied_pfns: (read only)
> > > > >         To get number of pfns bitmap copied in migration region.
> > > > >         Vendor driver should copy the bitmap with bits set only for
> > > > >         pages to be marked dirty in migration region. Vendor driver
> > > > >         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages
> > > > dirty in
> > > > >         requested range. Vendor driver should return
> > > > VFIO_DEVICE_DIRTY_PFNS_ALL
> > > > >         to mark all pages in the section as dirty.
> > > > >
> > > > > Migration region looks like:
> > > > >  ------------------------------------------------------------------
> > > > > |vfio_device_migration_info|    data section                      |
> > > > > |                          |     ///////////////////////////////  |
> > > > >  ------------------------------------------------------------------
> > > > >  ^                              ^                              ^
> > > > >  offset 0-trapped part        data_offset                 data_size
> > > > >
> > > > > Data section is always followed by vfio_device_migration_info
> > > > > structure in the region, so data_offset will always be non-0.
> > > > > Offset from where data is copied is decided by kernel driver, data
> > > > > section can be trapped or mapped depending on how kernel driver
> > > > > defines data section. If mmapped, then data_offset should be page
> > > > > aligned, where as initial section which contain
> > vfio_device_migration_info
> > > > > structure might not end at offset which is page aligned.
> > > > > Data_offset can be same or different for device data and dirty pages
> > bitmap.
> > > > > Vendor driver should decide whether to partition data section and how
> > to
> > > > > partition the data section. Vendor driver should return data_offset
> > > > > accordingly.
> > > > >
> > > > > For user application, data is opaque. User should write data in the same
> > > > > order as received.
> > > > >
> > > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > > ---
> > > > >  linux-headers/linux/vfio.h | 148
> > > > +++++++++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 148 insertions(+)
> > > > >
> > > > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > > > > index 24f505199f83..4bc0236b0898 100644
> > > > > --- a/linux-headers/linux/vfio.h
> > > > > +++ b/linux-headers/linux/vfio.h
> > > > > @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
> > > > >   */
> > > > >  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> > > > >
> > > > > +/* Migration region type and sub-type */
> > > > > +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> > > > > +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> > > > > +
> > > > > +/**
> > > > > + * Structure vfio_device_migration_info is placed at 0th offset of
> > > > > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
> > > > related migration
> > > > > + * information. Field accesses from this structure are only supported at
> > > > their
> > > > > + * native width and alignment, otherwise the result is undefined and
> > > > vendor
> > > > > + * drivers should return an error.
> > > > > + *
> > > > > + * device_state: (read/write)
> > > > > + *      To indicate vendor driver the state VFIO device should be
> > > > transitioned
> > > > > + *      to. If device state transition fails, write on this field return error.
> > > > > + *      It consists of 3 bits:
> > > > > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that
> > indicates
> > > > > + *        _STOPPED state. When device is changed to _STOPPED, driver
> > should
> > > > stop
> > > > > + *        device before write() returns.
> > > > > + *      - If bit 1 set, indicates _SAVING state.
> > > > > + *      - If bit 2 set, indicates _RESUMING state.
> > >
> > > please add a few words to explain _SAVING and _RESUMING, similar to
> > > what you did for _RUNNING.
> > >
> > > > > + *      Bits 3 - 31 are reserved for future use. User should perform
> > > > > + *      read-modify-write operation on this field.
> > > > > + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> > >
> > > what about _RUNNING | _RESUMING? Is it allowed?
> > 
> > I think this would be post-copy migration, which I assume is
> > theoretically supportable, but not necessarily (and not currently)
> > supported by vendor drivers.  I'm not sure how a vendor driver reports
> > the lack of this support though.
> 
> Yan is working on post-copy now. I talked to her about this open.
> She will respond later after some thinking. Ideally we need a way
> that user space can use to verify which combinations are available 
> before starting the post-copy process. It's not good to blindly start
> post-copy and then fail due to failure to switch to desired state on
> the dest machine.
>
hi,
May I know what vendor driver is supposed to do in _RESUMING state currently?
Could I understand it as that loading device data is only permitted in _RESUMING state,
and rejected otherwise?
If this understanding is right, is _LOADING a better name for it?
Then _SAVING and _LOADING is mutually exclusive obviously,
and we can clean the _LOADING state in .load_cleanup in qemu.
Then, state _LOADING | _RUNNING would only appear when VM has started up and
migration is still not completed.

And for postcopy, I guess anyway an extra POSTCOPY state needs to be introduced.
because in POSTCOPY state, vendor driver in target side needs to listen
to faulted pages and requests them from source vm, it has to be set when
migration is in postcopy stage (before VM running and after .load_setup) and
it is not necssarily equal to state _LOADING or state _LOADING|_RUNNING.

Thanks
Yan


> > 
> > > > > + *
> > > > > + * pending bytes: (read only)
> > > > > + *      Number of pending bytes yet to be migrated from vendor driver
> > > > > + *
> > > > > + * data_offset: (read only)
> > > > > + *      User application should read data_offset in migration region from
> > > > where
> > > > > + *      user application should read device data during _SAVING state or
> > > > write
> > > > > + *      device data during _RESUMING state or read dirty pages bitmap.
> > See
> > > > below
> > > > > + *      for detail of sequence to be followed.
> > > > > + *
> > > > > + * data_size: (read/write)
> > > > > + *      User application should read data_size to get size of data copied
> > in
> > > > > + *      migration region during _SAVING state and write size of data
> > copied
> > > > in
> > > > > + *      migration region during _RESUMING state.
> > > > > + *
> > > > > + * start_pfn: (write only)
> > > > > + *      Start address pfn to get bitmap of dirty pages from vendor driver
> > > > duing
> > > > > + *      _SAVING state.
> > > > > + *
> > > > > + * page_size: (write only)
> > > > > + *      User application should write the page_size of pfn.
> > > > > + *
> > > > > + * total_pfns: (write only)
> > > > > + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> > > > > + *
> > > > > + * copied_pfns: (read only)
> > >
> > > 'copied' gives the impression as if the page content is copied. what about
> > > 'dirty_pfns'? btw can this field merge with total_pfns?
> > 
> > I don't agree with the implication that copied implies page content,
> > we're working with a bitmap, so this is easily disproved.  I think the
> > intent of 'copied' is to differentiate that 'total' is how many the
> > user asked for, and 'copied' is how many they got.  It's not obvious to
> > me whether there's anything that would prevent duplicate use of one
> > register, ie. write total, read copied.  I'm not sure I see a huge
> > advantage to it either though.
> 
> It's fine. I just had the impression that many kernel interfaces are 
> designed in such way, i.e., having the same field to carry-in user
> requested number and then carry-out actually handled number.
> 
> > 
> > > > > + *      pfn count for which dirty bitmap is copied to migration region.
> > > > > + *      Vendor driver should copy the bitmap with bits set only for pages
> > to
> > > > be
> > > > > + *      marked dirty in migration region.
> > > > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> > > > none of the
> > > > > + *        pages are dirty in requested range or rest of the range.
> > > > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to
> > mark
> > > > all
> > > > > + *        pages dirty in the given range or rest of the range.
> > > > > + *      - Vendor driver should return pfn count for which bitmap is
> > written in
> > > > > + *        the region.
> > > > > + *
> > > > > + * Migration region looks like:
> > > > > + *  ------------------------------------------------------------------
> > > > > + * |vfio_device_migration_info|    data section                      |
> > > > > + * |                          |     ///////////////////////////////  |
> > > > > + * ------------------------------------------------------------------
> > > > > + *   ^                              ^                             ^
> > > > > + *  offset 0-trapped part        data_offset                 data_size
> > >
> > > 'data_size' -> "data_offset + data_size"
> > 
> > The diagram sort of implies that, but I don't think this is correct.  I
> > believe it's more like:
> > 
> > data_offset = data_start_offset
> > data_size = data_end_offset - data_start_offset
> 
> I'm fine with replacing data_offset & data_size with data_start_
> offset & data_end_offset.
> 
> > 
> > > > > + *
> > > > > + * Data section is always followed by vfio_device_migration_info
> > structure
> > >
> > > Data section is always following ..., or ... structure is always followed by
> > > data section.
> > 
> > The vfio_device_migration struct is always at the zero'th offset of the
> > region, the data section is theoretically the remainder, but the used
> > range of the data section is defined by the vendor driver via
> > data_offset.
> 
> Isn't original description incorrect here? It implied data section sits at
> offset zero, and then migration info...
> 
> > 
> > > > > + * in the region, so data_offset will always be non-0. Offset from where
> > > > data
> > > > > + * is copied is decided by kernel driver, data section can be trapped or
> > > > > + * mapped or partitioned, depending on how kernel driver defines data
> > > > section.
> > > > > + * Data section partition can be defined as mapped by sparse mmap
> > > > capability.
> > > > > + * If mmapped, then data_offset should be page aligned, where as
> > initial
> > > > section
> > > > > + * which contain vfio_device_migration_info structure might not end at
> > > > offset
> > > > > + * which is page aligned.
> > > > > + * Data_offset can be same or different for device data and dirty pages
> > > > bitmap.
> > > > > + * Vendor driver should decide whether to partition data section and
> > how
> > > > to
> > > > > + * partition the data section. Vendor driver should return data_offset
> > > > > + * accordingly.
> > >
> > > Here lacks of high-level summary about how to differentiate reading device
> > > state from reading dirty page bitmap, when both use the same interface
> > > (data_offset) to convey information to user space.
> > >
> > > From below sequence example, looks reading device state is initiated by
> > > reading pending_bytes, while reading dirty bitmap is marked by writing
> > > start_pfn. In case of shared data region between two operations, they have
> > > to be mutually-exclusive i.e. one must wait for the other to complete. Even
> > > when the region is partitioned, data_offset itself could be raced if pending_
> > > bytes and start_pfn are accessed at the same time. How do we expect the
> > > application to cope with it? Isn't it too limiting with such design?
> > >
> > > Since you anyway invent different sets of fields for two operations, why not
> > > forcing partitioned flavor and then introduce two data_offset fields for each
> > > other? This way the application is free to intermix device state and dirty
> > > page collection for whatever needs.
> > 
> > AIUI, it's the user's responsibility to consume the data they've
> > asked to be provided before they perform the next operation, but the
> > user can alternate between device state and dirty pages at will.  I
> > agree though that the lifecycle of the data with regard to the vendor
> > driver is lacking.  Nothing seems to indicate to the vendor driver that
> > the data is consumed other than starting the next operation or turning
> > off _SAVING.
> > 
> > > > > + *
> > > > > + * Sequence to be followed for _SAVING|_RUNNING device state or
> > pre-
> > > > copy phase
> > > > > + * and for _SAVING device state or stop-and-copy phase:
> > > > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > > > > + * b. read data_offset, indicates kernel driver to write data to staging
> > > > buffer.
> > > > > + * c. read data_size, amount of data in bytes written by vendor driver in
> > > > > + *    migration region.
> > > > > + * d. read data_size bytes of data from data_offset in the migration
> > region.
> > > > > + * e. process data.
> > > > > + * f. Loop through a to e.
> > > >
> > > > Something needs to be said here about the availability of the data, for
> > > > example, what indicates to the vendor driver that the above operation is
> > > > complete?  Is the data immutable?
> > >
> > > I guess the vendor driver just continues to track pending_bytes for dirtied
> > > device state, until exiting _SAVING state. Data is copied only when
> > > pending_bytes is read by userspace. Copied data is immutable if data
> > region
> > > is mmapped. But yes, this part needs clarification.
> > >
> > > >
> > > > > + *
> > > > > + * To copy system memory content during migration, vendor driver
> > should
> > > > be able
> > > > > + * to report system memory pages which are dirtied by that driver. For
> > such
> > > > > + * dirty page reporting, user application should query for a range of
> > GFNs
> > > > > + * relative to device address space (IOVA), then vendor driver should
> > > > provide
> > > > > + * the bitmap of pages from this range which are dirtied by him through
> > > > > + * migration region where each bit represents a page and bit set to 1
> > > > represents
> > > > > + * that the page is dirty.
> > > > > + * User space application should take care of copying content of system
> > > > memory
> > > > > + * for those pages.
> > > >
> > > > Can we say that device state and dirty pfn operations on the data
> > > > area may be intermixed in any order the user chooses?
> > >
> > > this part is very opaque from previous description. Now I'm inclined to
> > > vote for no-intermix design. There is really no good to force some
> > > dependency between retrieving device state and dirty page bitmap.
> > 
> > I'm confused, the state of the device continues to change so long as it
> > is _RUNNING.  The device may also continue to dirty pages in the
> > _RUNNING state.  So how could we not have both occurring at the same
> > time and therefore must support the user deciding which to sample in
> > each iteration?
> 
> I may not make it clear. Actually I'm with you on that device state and
> dirty pages are changed in parallel in the _RUNNING state. and two are
> irrelevant by definition. Then why did we design an interface, as this
> patch suggested, having two states sharing the same region thus forcing 
> the application to manage unnecessary contention in-between? For
> example, if the application creates two threads, with one to collect 
> dirty page bitmaps and the other to dump device state, such design 
> requires the application to protect the data region between two 
> threads - one thread has to wait until the other drains the new state 
> reported by kernel driver, otherwise the data region will be messed.
> 
> On the other hand, if we explicitly design two sub-regions for device
> state and dirty bitmap and split the control interface, two can
> be enquired simultaneously. As earlier discussed, two states already
> have most of their control fields separate except data_offset. Now 
> just move one step further to have two data_offset, then two states
> are completely decoupled and the application is lock-free to access 
> any sub-region at any time.
> 
> > 
> > > > Should we specify that bits accumulate since EITHER a) _SAVING state is
> > > > enabled or b) a given pfn was last reported via the below sequence (ie.
> > > > dirty bits are cleared once reported)?
> > > >
> > > > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > > > performing the live portion of a migration?  For example, each time a
> > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > system_memory).  I don't see any QEMU code that intercepts that change
> > > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > > an unmap notification since it only knows the IOVA pfns, which can be
> > > > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > > please help me find the code that handles it.
> > >
> > > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > by emulated DMAs are recorded in some implicit way. But KVM always
> > > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > >  (translation can be done when DMA happens), then we don't need
> > > worry about transient mapping from IOVA to GFN. Along this way we
> > > also want GFN-based dirty bitmap being reported through VFIO,
> > > similar to what KVM does. For vendor drivers, it needs to translate
> > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > provided by KVM but I'm not sure whether it's exposed now.
> > >
> > > >
> > > > > + *
> > > > > + * Steps to get dirty page bitmap:
> > > > > + * a. write start_pfn, page_size and total_pfns.
> > > >
> > > > This is not well specified.  Is it intended that the user write all
> > > > three of these on every iteration, or could they write start_pfn=0,
> > > > page_size=4K, total_pfns=1, complete the steps below, then write
> > > > start_pfn=1 and immediately begin the next iteration?  They've written
> > > > all three, though not all on the current iteration, does that count?
> > > > Furthermore, could the user simple re-read copied_pfns to determine if
> > > > anything in the previously setup range has been re-dirtied?
> > > >
> > > > IOW, are these three "registers" sticky or do the below operations
> > > > invalidate them?  If they're invalidated, then there needs to be a
> > > > mechanism to generate an error, such as below.
> > > >
> > > > > + * b. read copied_pfns. Vendor driver should take one of the below
> > action:
> > > > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> > > > driver
> > > > > + *       doesn't have any page to report dirty in given range or rest of the
> > > > > + *       range. Exit the loop.
> > > > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to
> > mark
> > > > all
> > > > > + *       pages dirty for given range or rest of the range. User space
> > > > > + *       application mark all pages in the range as dirty and exit the loop.
> > > > > + *     - Vendor driver should return copied_pfns and provide bitmap for
> > > > > + *       copied_pfn in migration region.
> > > >
> > > > Read returns errno if the pre-requisite registers are not valid?
> > > >
> > > > > + * c. read data_offset, where vendor driver has written bitmap.
> > > > > + * d. read bitmap from the migration region from data_offset.
> > > > > + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
> > > >
> > > > It seems like the intent here is that the user effectively does:
> > > >
> > > > start_pfn += copied_pfns
> > > > total_pfns -= copied_pfns
> > > > page_size = page_size?
> > > >
> > > > But are they under any obligation to do so?
> > > >
> > > > Also same question above regarding data availability/life-cycle.  Is
> > > > the vendor driver responsible for making the data available
> > > > indefinitely?  Seems it's only released at the next iteration, or
> > > > re-use of the data area for another operation, or clearing of the
> > > > _SAVING state bit.
> > > >
> > > > > + * Sequence to be followed while _RESUMING device state:
> > > > > + * While data for this device is available, repeat below steps:
> > > > > + * a. read data_offset from where user application should write data.
> > > > > + * b. write data of data_size to migration region from data_offset.
> > > > > + * c. write data_size which indicates vendor driver that data is written
> > in
> > > > > + *    staging buffer.
> > > > > + *
> > > > > + * For user application, data is opaque. User should write data in the
> > same
> > > > > + * order as received.
> > > >
> > > > Additionally, implicit synchronization between _SAVING and _RESUMING
> > > > ends within the vendor driver is assumed.
> > > >
> > > > Are there any assumptions we haven't covered with respect to mmaps?
> > > > For instance, can the user setup mmaps at any time or only during
> > > > certain device states?  Are there recommended best practices for users
> > > > to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
> > > > mechanism, it might be nice to use it when either of these bits are
> > > > cleared.  Thanks,
> > >
> > > another open for mmaps is how many pages to map. copied_pfns
> > > carries only the number of copied pages. They scatter in the bitmap
> > > while application doesn't know how long the bitmap is. Do we
> > > expect the application to map pages one-by-one, until scanned
> > > dirty pages is equal to copied_pfns?
> > 
> > I think we expect userspace to mmap according to the sparse mmap
> > capability of the region.  The vendor driver then chooses whether to
> > expose the current data set within the mmap region, outside the mmap
> > region, or some mix of both, depending on their requirements.  Thanks,
> 
> Thanks, it makes sense.
> 
> > 
> > Alex
> > 
> > > > > + */
> > > > > +
> > > > > +struct vfio_device_migration_info {
> > > > > +        __u32 device_state;         /* VFIO device state */
> > > > > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > > > > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > > > > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > > > > +#define VFIO_DEVICE_STATE_MASK
> > (VFIO_DEVICE_STATE_RUNNING | \
> > > > > +                                     VFIO_DEVICE_STATE_SAVING | \
> > > > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > > > +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING
> > | \
> > > > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > > > +        __u32 reserved;
> > > > > +        __u64 pending_bytes;
> > > > > +        __u64 data_offset;
> > > > > +        __u64 data_size;
> > > > > +        __u64 start_pfn;
> > > > > +        __u64 page_size;
> > > > > +        __u64 total_pfns;
> > > > > +        __u64 copied_pfns;
> > > > > +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> > > > > +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> > > > > +} __attribute__((packed));
> > > > > +
> > > > >  /*
> > > > >   * The MSIX mappable capability informs that MSIX data of a BAR can
> > be
> > > > mmapped
> > > > >   * which allows direct access to non-MSIX registers which happened to
> > be
> > > > within
> > >
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-03  6:57           ` Tian, Kevin
@ 2019-09-12 14:41             ` Alex Williamson
  2019-09-12 23:00               ` Tian, Kevin
       [not found]               ` <AADFC41AFE54684AB9EE6CBC0274A5D19D572142@SHSMSX104.ccr.corp.intel.com>
  0 siblings, 2 replies; 34+ messages in thread
From: Alex Williamson @ 2019-09-12 14:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang, Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

On Tue, 3 Sep 2019 06:57:27 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Saturday, August 31, 2019 12:33 AM
> > 
> > On Fri, 30 Aug 2019 08:06:32 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Tian, Kevin
> > > > Sent: Friday, August 30, 2019 3:26 PM
> > > >  
> > > [...]  
> > > > > How does QEMU handle the fact that IOVAs are potentially dynamic  
> > while  
> > > > > performing the live portion of a migration?  For example, each time a
> > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > system_memory).  I don't see any QEMU code that intercepts that  
> > change  
> > > > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > > > an unmap notification since it only knows the IOVA pfns, which can be
> > > > > re-used with different GFN backing.  Once the DMA mapping is torn  
> > down,  
> > > > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > > > please help me find the code that handles it.  
> > > >
> > > > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > > by emulated DMAs are recorded in some implicit way. But KVM always
> > > > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > > >  (translation can be done when DMA happens), then we don't need
> > > > worry about transient mapping from IOVA to GFN. Along this way we
> > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > provided by KVM but I'm not sure whether it's exposed now.
> > > >  
> > >
> > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.  
> > 
> > I thought it was bad enough that we have vendor drivers that depend on
> > KVM, but designing a vfio interface that only supports a KVM interface
> > is more undesirable.  I also note without comment that gfn_to_memslot()
> > is a GPL symbol.  Thanks,  
> 
> yes it is bad, but sometimes inevitable. If you recall our discussions
> back to 3yrs (when discussing the 1st mdev framework), there were similar
> hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> creating some shadow structures. gpa->hpa is definitely hypervisor
> specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> hypercall in Xen. but VFIO already makes assumption based on KVM-
> only flavor when implementing vfio_{un}pin_page_external.

Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA and
HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the HVA
to get an HPA and provide an array of HPA pfns back to the caller.  The
other vGPU mdev vendor manages to make use of this without KVM... the
KVM interface used by GVT-g is GPL-only.

> So GVT-g
> has to maintain an internal abstraction layer to support both Xen and
> KVM. Maybe someday we will re-consider introducing some hypervisor
> abstraction layer in VFIO, if this issue starts to hurt other devices and
> Xen guys are willing to support VFIO.

Once upon a time, we had a KVM specific device assignment interface,
ie. legacy KVM devie assignment.  We developed VFIO specifically to get
KVM out of the business of being a (bad) device driver.  We do have
some awareness and interaction between VFIO and KVM in the vfio-kvm
pseudo device, but we still try to keep those interfaces generic.  In
some cases we're not very successful at that, see vfio_group_set_kvm(),
but that's largely just a mechanism to associate a cookie with a group
to be consumed by the mdev vendor driver such that it can work with kvm
external to vfio.  I don't intend to add further hypervisor awareness
to vfio.

> Back to this IOVA issue, I discussed with Yan and we found another 
> hypervisor-agnostic alternative, by learning from vhost. vhost is very
> similar to VFIO - DMA also happens in the kernel, while it already 
> supports vIOMMU.
> 
> Generally speaking, there are three paths of dirty page collection
> in Qemu so far (as previously noted, Qemu always tracks the dirty
> bitmap in GFN):

GFNs or simply PFNs within an AddressSpace?
 
> 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps 
> are updated directly when the guest memory is being updated. For 
> example, PCI writes are completed through pci_dma_write, which 
> goes through vIOMMU to translate IOVA into GPA and then update 
> the bitmap through cpu_physical_memory_set_dirty_range.

Right, so the IOVA to GPA (GFN) occurs through an explicit translation
on the IOMMU AddressSpace.
 
> 2) Memory writes that are not tracked by Qemu are collected by
> registering .log_sync() callback, which is invoked in the dirty logging
> process. Now there are two users: kvm and vhost.
> 
>   2.1) KVM tracks CPU-side memory writes, through write-protection
> or EPT A/D bits (+PML). This part is always based on GFN and returned
> to Qemu when kvm_log_sync is invoked;
> 
>   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> data structure. It maintains an internal iotlb which is synced with
> Qemu vIOMMU through a specific interface:
> 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> for Qemu to keep vhost iotlb in sync
> 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> a miss in vhost iotlb.
> 	- Qemu registers a log buffer to kernel vhost driver. The latter
> update the buffer (using internal iotlb to get GFN) when serving vring
> descriptor.
> 
> VFIO could also implement an internal iotlb, so vendor drivers can
> utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> don't need re-invent another iotlb protocol as vhost does. vIOMMU
> already sends map/unmap ioctl cmds upon any change of IOVA
> mapping. We may introduce a v2 map/unmap interface, allowing
> Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> doesn't want to cache full-size vIOMMU mappings. 
> 
> Definitely this alternative needs more work and possibly less 
> performant (if maintaining a small size iotlb) than straightforward
> calling into KVM interface. But the gain is also obvious, since it
> is fully constrained with VFIO.
> 
> Thoughts? :-)

So vhost must then be configuring a listener across system memory
rather than only against the device AddressSpace like we do in vfio,
such that it get's log_sync() callbacks for the actual GPA space rather
than only the IOVA space.  OTOH, QEMU could understand that the device
AddressSpace has a translate function and apply the IOVA dirty bits to
the system memory AddressSpace.  Wouldn't it make more sense for QEMU
to perform a log_sync() prior to removing a MemoryRegionSection within
an AddressSpace and update the GPA rather than pushing GPA awareness
and potentially large tracking structures into the host kernel?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-12 14:41             ` Alex Williamson
@ 2019-09-12 23:00               ` Tian, Kevin
  2019-09-13 15:47                 ` Alex Williamson
       [not found]               ` <AADFC41AFE54684AB9EE6CBC0274A5D19D572142@SHSMSX104.ccr.corp.intel.com>
  1 sibling, 1 reply; 34+ messages in thread
From: Tian, Kevin @ 2019-09-12 23:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, September 12, 2019 10:41 PM
> 
> On Tue, 3 Sep 2019 06:57:27 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Saturday, August 31, 2019 12:33 AM
> > >
> > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Tian, Kevin
> > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > >
> > > > [...]
> > > > > > How does QEMU handle the fact that IOVAs are potentially
> dynamic
> > > while
> > > > > > performing the live portion of a migration?  For example, each
> time a
> > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > MemoryRegionSection pops in or out of the AddressSpace for the
> device
> > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > system_memory).  I don't see any QEMU code that intercepts that
> > > change
> > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> recorded and
> > > > > > translated to GFNs.  The vendor driver can't track these beyond
> getting
> > > > > > an unmap notification since it only knows the IOVA pfns, which
> can be
> > > > > > re-used with different GFN backing.  Once the DMA mapping is
> torn
> > > down,
> > > > > > it seems those dirty pfns are lost in the ether.  If this works in
> QEMU,
> > > > > > please help me find the code that handles it.
> > > > >
> > > > > I'm curious about this part too. Interestingly, I didn't find any
> log_sync
> > > > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > > > by emulated DMAs are recorded in some implicit way. But KVM
> always
> > > > > reports dirty page in GFN instead of IOVA, regardless of the
> presence of
> > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > > > >  (translation can be done when DMA happens), then we don't need
> > > > > worry about transient mapping from IOVA to GFN. Along this way
> we
> > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > >
> > > >
> > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.
> > >
> > > I thought it was bad enough that we have vendor drivers that depend
> on
> > > KVM, but designing a vfio interface that only supports a KVM interface
> > > is more undesirable.  I also note without comment that
> gfn_to_memslot()
> > > is a GPL symbol.  Thanks,
> >
> > yes it is bad, but sometimes inevitable. If you recall our discussions
> > back to 3yrs (when discussing the 1st mdev framework), there were
> similar
> > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > creating some shadow structures. gpa->hpa is definitely hypervisor
> > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > only flavor when implementing vfio_{un}pin_page_external.
> 
> Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> and
> HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the HVA
> to get an HPA and provide an array of HPA pfns back to the caller.  The
> other vGPU mdev vendor manages to make use of this without KVM... the
> KVM interface used by GVT-g is GPL-only.

To be clear it's the assumption on the host-based hypervisors e.g. KVM.
GUP is a perfect example, which doesn't work for Xen since DomU's
memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
through Xen specific hypercalls.

> 
> > So GVT-g
> > has to maintain an internal abstraction layer to support both Xen and
> > KVM. Maybe someday we will re-consider introducing some hypervisor
> > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > Xen guys are willing to support VFIO.
> 
> Once upon a time, we had a KVM specific device assignment interface,
> ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> KVM out of the business of being a (bad) device driver.  We do have
> some awareness and interaction between VFIO and KVM in the vfio-kvm
> pseudo device, but we still try to keep those interfaces generic.  In
> some cases we're not very successful at that, see vfio_group_set_kvm(),
> but that's largely just a mechanism to associate a cookie with a group
> to be consumed by the mdev vendor driver such that it can work with kvm
> external to vfio.  I don't intend to add further hypervisor awareness
> to vfio.
> 
> > Back to this IOVA issue, I discussed with Yan and we found another
> > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > similar to VFIO - DMA also happens in the kernel, while it already
> > supports vIOMMU.
> >
> > Generally speaking, there are three paths of dirty page collection
> > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > bitmap in GFN):
> 
> GFNs or simply PFNs within an AddressSpace?
> 
> > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > are updated directly when the guest memory is being updated. For
> > example, PCI writes are completed through pci_dma_write, which
> > goes through vIOMMU to translate IOVA into GPA and then update
> > the bitmap through cpu_physical_memory_set_dirty_range.
> 
> Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> on the IOMMU AddressSpace.
> 
> > 2) Memory writes that are not tracked by Qemu are collected by
> > registering .log_sync() callback, which is invoked in the dirty logging
> > process. Now there are two users: kvm and vhost.
> >
> >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > to Qemu when kvm_log_sync is invoked;
> >
> >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > data structure. It maintains an internal iotlb which is synced with
> > Qemu vIOMMU through a specific interface:
> > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > for Qemu to keep vhost iotlb in sync
> > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > a miss in vhost iotlb.
> > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > update the buffer (using internal iotlb to get GFN) when serving vring
> > descriptor.
> >
> > VFIO could also implement an internal iotlb, so vendor drivers can
> > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > already sends map/unmap ioctl cmds upon any change of IOVA
> > mapping. We may introduce a v2 map/unmap interface, allowing
> > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > doesn't want to cache full-size vIOMMU mappings.
> >
> > Definitely this alternative needs more work and possibly less
> > performant (if maintaining a small size iotlb) than straightforward
> > calling into KVM interface. But the gain is also obvious, since it
> > is fully constrained with VFIO.
> >
> > Thoughts? :-)
> 
> So vhost must then be configuring a listener across system memory
> rather than only against the device AddressSpace like we do in vfio,
> such that it get's log_sync() callbacks for the actual GPA space rather
> than only the IOVA space.  OTOH, QEMU could understand that the device
> AddressSpace has a translate function and apply the IOVA dirty bits to
> the system memory AddressSpace.  Wouldn't it make more sense for
> QEMU
> to perform a log_sync() prior to removing a MemoryRegionSection within
> an AddressSpace and update the GPA rather than pushing GPA awareness
> and potentially large tracking structures into the host kernel?  Thanks,
> 

It is an interesting idea.  One drawback is that log_sync might be
frequently invoked in IOVA case, but I guess the overhead is not much 
compared to the total overhead of emulating the IOTLB invalidation. 
Maybe other folks can better comment why this model was not 
considered before, e.g. when vhost iotlb was introduced.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-12 23:00               ` Tian, Kevin
@ 2019-09-13 15:47                 ` Alex Williamson
  2019-09-16  1:53                   ` Tian, Kevin
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Williamson @ 2019-09-13 15:47 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang, Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

On Thu, 12 Sep 2019 23:00:03 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, September 12, 2019 10:41 PM
> > 
> > On Tue, 3 Sep 2019 06:57:27 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > >
> > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Tian, Kevin
> > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > >  
> > > > > [...]  
> > > > > > > How does QEMU handle the fact that IOVAs are potentially  
> > dynamic  
> > > > while  
> > > > > > > performing the live portion of a migration?  For example, each  
> > time a  
> > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > MemoryRegionSection pops in or out of the AddressSpace for the  
> > device  
> > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > > system_memory).  I don't see any QEMU code that intercepts that  
> > > > change  
> > > > > > > in the AddressSpace such that the IOVA dirty pfns could be  
> > recorded and  
> > > > > > > translated to GFNs.  The vendor driver can't track these beyond  
> > getting  
> > > > > > > an unmap notification since it only knows the IOVA pfns, which  
> > can be  
> > > > > > > re-used with different GFN backing.  Once the DMA mapping is  
> > torn  
> > > > down,  
> > > > > > > it seems those dirty pfns are lost in the ether.  If this works in  
> > QEMU,  
> > > > > > > please help me find the code that handles it.  
> > > > > >
> > > > > > I'm curious about this part too. Interestingly, I didn't find any  
> > log_sync  
> > > > > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > > > > by emulated DMAs are recorded in some implicit way. But KVM  
> > always  
> > > > > > reports dirty page in GFN instead of IOVA, regardless of the  
> > presence of  
> > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > > > > >  (translation can be done when DMA happens), then we don't need
> > > > > > worry about transient mapping from IOVA to GFN. Along this way  
> > we  
> > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > >  
> > > > >
> > > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.  
> > > >
> > > > I thought it was bad enough that we have vendor drivers that depend  
> > on  
> > > > KVM, but designing a vfio interface that only supports a KVM interface
> > > > is more undesirable.  I also note without comment that  
> > gfn_to_memslot()  
> > > > is a GPL symbol.  Thanks,  
> > >
> > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > back to 3yrs (when discussing the 1st mdev framework), there were  
> > similar  
> > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > > only flavor when implementing vfio_{un}pin_page_external.  
> > 
> > Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> > and
> > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the HVA
> > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > other vGPU mdev vendor manages to make use of this without KVM... the
> > KVM interface used by GVT-g is GPL-only.  
> 
> To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> GUP is a perfect example, which doesn't work for Xen since DomU's
> memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> through Xen specific hypercalls.

VFIO does not assume a hypervisor at all.  Yes, it happens to work well
with a host-based hypervisor like KVM were we can simply use GUP, but
I'd hardly call using the standard mechanism to pin a user page and get
the pfn within the Linux kernel a KVM assumption.  The fact that Dom0
Xen requires work here while KVM does not does is not an equivalency to
VFIO assuming KVM.  Thanks,

Alex
 
> > > So GVT-g
> > > has to maintain an internal abstraction layer to support both Xen and
> > > KVM. Maybe someday we will re-consider introducing some hypervisor
> > > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > > Xen guys are willing to support VFIO.  
> > 
> > Once upon a time, we had a KVM specific device assignment interface,
> > ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> > KVM out of the business of being a (bad) device driver.  We do have
> > some awareness and interaction between VFIO and KVM in the vfio-kvm
> > pseudo device, but we still try to keep those interfaces generic.  In
> > some cases we're not very successful at that, see vfio_group_set_kvm(),
> > but that's largely just a mechanism to associate a cookie with a group
> > to be consumed by the mdev vendor driver such that it can work with kvm
> > external to vfio.  I don't intend to add further hypervisor awareness
> > to vfio.
> >   
> > > Back to this IOVA issue, I discussed with Yan and we found another
> > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > similar to VFIO - DMA also happens in the kernel, while it already
> > > supports vIOMMU.
> > >
> > > Generally speaking, there are three paths of dirty page collection
> > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > bitmap in GFN):  
> > 
> > GFNs or simply PFNs within an AddressSpace?
> >   
> > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > > are updated directly when the guest memory is being updated. For
> > > example, PCI writes are completed through pci_dma_write, which
> > > goes through vIOMMU to translate IOVA into GPA and then update
> > > the bitmap through cpu_physical_memory_set_dirty_range.  
> > 
> > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > on the IOMMU AddressSpace.
> >   
> > > 2) Memory writes that are not tracked by Qemu are collected by
> > > registering .log_sync() callback, which is invoked in the dirty logging
> > > process. Now there are two users: kvm and vhost.
> > >
> > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > > to Qemu when kvm_log_sync is invoked;
> > >
> > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > data structure. It maintains an internal iotlb which is synced with
> > > Qemu vIOMMU through a specific interface:
> > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > for Qemu to keep vhost iotlb in sync
> > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > a miss in vhost iotlb.
> > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > update the buffer (using internal iotlb to get GFN) when serving vring
> > > descriptor.
> > >
> > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > doesn't want to cache full-size vIOMMU mappings.
> > >
> > > Definitely this alternative needs more work and possibly less
> > > performant (if maintaining a small size iotlb) than straightforward
> > > calling into KVM interface. But the gain is also obvious, since it
> > > is fully constrained with VFIO.
> > >
> > > Thoughts? :-)  
> > 
> > So vhost must then be configuring a listener across system memory
> > rather than only against the device AddressSpace like we do in vfio,
> > such that it get's log_sync() callbacks for the actual GPA space rather
> > than only the IOVA space.  OTOH, QEMU could understand that the device
> > AddressSpace has a translate function and apply the IOVA dirty bits to
> > the system memory AddressSpace.  Wouldn't it make more sense for
> > QEMU
> > to perform a log_sync() prior to removing a MemoryRegionSection within
> > an AddressSpace and update the GPA rather than pushing GPA awareness
> > and potentially large tracking structures into the host kernel?  Thanks,
> >   
> 
> It is an interesting idea.  One drawback is that log_sync might be
> frequently invoked in IOVA case, but I guess the overhead is not much 
> compared to the total overhead of emulating the IOTLB invalidation. 
> Maybe other folks can better comment why this model was not 
> considered before, e.g. when vhost iotlb was introduced.
> 
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-13 15:47                 ` Alex Williamson
@ 2019-09-16  1:53                   ` Tian, Kevin
  0 siblings, 0 replies; 34+ messages in thread
From: Tian, Kevin @ 2019-09-16  1:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, September 13, 2019 11:48 PM
> 
> On Thu, 12 Sep 2019 23:00:03 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, September 12, 2019 10:41 PM
> > >
> > > On Tue, 3 Sep 2019 06:57:27 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > > >
> > > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >
> > > > > > > From: Tian, Kevin
> > > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > > >
> > > > > > [...]
> > > > > > > > How does QEMU handle the fact that IOVAs are potentially
> > > dynamic
> > > > > while
> > > > > > > > performing the live portion of a migration?  For example, each
> > > time a
> > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for
> the
> > > device
> > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is
> not
> > > > > > > > system_memory).  I don't see any QEMU code that intercepts
> that
> > > > > change
> > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> > > recorded and
> > > > > > > > translated to GFNs.  The vendor driver can't track these beyond
> > > getting
> > > > > > > > an unmap notification since it only knows the IOVA pfns, which
> > > can be
> > > > > > > > re-used with different GFN backing.  Once the DMA mapping is
> > > torn
> > > > > down,
> > > > > > > > it seems those dirty pfns are lost in the ether.  If this works in
> > > QEMU,
> > > > > > > > please help me find the code that handles it.
> > > > > > >
> > > > > > > I'm curious about this part too. Interestingly, I didn't find any
> > > log_sync
> > > > > > > callback registered by emulated devices in Qemu. Looks dirty
> pages
> > > > > > > by emulated DMAs are recorded in some implicit way. But KVM
> > > always
> > > > > > > reports dirty page in GFN instead of IOVA, regardless of the
> > > presence of
> > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated
> DMAs
> > > > > > >  (translation can be done when DMA happens), then we don't
> need
> > > > > > > worry about transient mapping from IOVA to GFN. Along this
> way
> > > we
> > > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > > similar to what KVM does. For vendor drivers, it needs to
> translate
> > > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > > >
> > > > > >
> > > > > > HVA->GFN can be done through hva_to_gfn_memslot in
> kvm_host.h.
> > > > >
> > > > > I thought it was bad enough that we have vendor drivers that
> depend
> > > on
> > > > > KVM, but designing a vfio interface that only supports a KVM
> interface
> > > > > is more undesirable.  I also note without comment that
> > > gfn_to_memslot()
> > > > > is a GPL symbol.  Thanks,
> > > >
> > > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > > back to 3yrs (when discussing the 1st mdev framework), there were
> > > similar
> > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > > > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > > > only flavor when implementing vfio_{un}pin_page_external.
> > >
> > > Where's the KVM assumption there?  The MAP_DMA ioctl takes an
> IOVA
> > > and
> > > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the
> HVA
> > > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > > other vGPU mdev vendor manages to make use of this without KVM...
> the
> > > KVM interface used by GVT-g is GPL-only.
> >
> > To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> > GUP is a perfect example, which doesn't work for Xen since DomU's
> > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> > through Xen specific hypercalls.
> 
> VFIO does not assume a hypervisor at all.  Yes, it happens to work well
> with a host-based hypervisor like KVM were we can simply use GUP, but
> I'd hardly call using the standard mechanism to pin a user page and get
> the pfn within the Linux kernel a KVM assumption.  The fact that Dom0
> Xen requires work here while KVM does not does is not an equivalency to
> VFIO assuming KVM.  Thanks,
> 

Agree, thanks for clarification.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v8 01/13] vfio: KABI for migration interface
       [not found]               ` <AADFC41AFE54684AB9EE6CBC0274A5D19D572142@SHSMSX104.ccr.corp.intel.com>
@ 2019-09-24  2:19                 ` Tian, Kevin
  2019-09-24 18:03                   ` Alex Williamson
  0 siblings, 1 reply; 34+ messages in thread
From: Tian, Kevin @ 2019-09-24  2:19 UTC (permalink / raw)
  To: 'Alex Williamson'
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

> From: Tian, Kevin
> Sent: Friday, September 13, 2019 7:00 AM
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, September 12, 2019 10:41 PM
> >
> > On Tue, 3 Sep 2019 06:57:27 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > >
> > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >
> > > > > > From: Tian, Kevin
> > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > >
> > > > > [...]
> > > > > > > How does QEMU handle the fact that IOVAs are potentially
> > dynamic
> > > > while
> > > > > > > performing the live portion of a migration?  For example, each
> > time a
> > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > MemoryRegionSection pops in or out of the AddressSpace for
> the
> > device
> > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > > system_memory).  I don't see any QEMU code that intercepts
> that
> > > > change
> > > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> > recorded and
> > > > > > > translated to GFNs.  The vendor driver can't track these beyond
> > getting
> > > > > > > an unmap notification since it only knows the IOVA pfns, which
> > can be
> > > > > > > re-used with different GFN backing.  Once the DMA mapping is
> > torn
> > > > down,
> > > > > > > it seems those dirty pfns are lost in the ether.  If this works in
> > QEMU,
> > > > > > > please help me find the code that handles it.
> > > > > >
> > > > > > I'm curious about this part too. Interestingly, I didn't find any
> > log_sync
> > > > > > callback registered by emulated devices in Qemu. Looks dirty
> pages
> > > > > > by emulated DMAs are recorded in some implicit way. But KVM
> > always
> > > > > > reports dirty page in GFN instead of IOVA, regardless of the
> > presence of
> > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated
> DMAs
> > > > > >  (translation can be done when DMA happens), then we don't
> need
> > > > > > worry about transient mapping from IOVA to GFN. Along this way
> > we
> > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > >
> > > > >
> > > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.
> > > >
> > > > I thought it was bad enough that we have vendor drivers that depend
> > on
> > > > KVM, but designing a vfio interface that only supports a KVM interface
> > > > is more undesirable.  I also note without comment that
> > gfn_to_memslot()
> > > > is a GPL symbol.  Thanks,
> > >
> > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > back to 3yrs (when discussing the 1st mdev framework), there were
> > similar
> > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > > only flavor when implementing vfio_{un}pin_page_external.
> >
> > Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> > and
> > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the
> HVA
> > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > other vGPU mdev vendor manages to make use of this without KVM... the
> > KVM interface used by GVT-g is GPL-only.
> 
> To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> GUP is a perfect example, which doesn't work for Xen since DomU's
> memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> through Xen specific hypercalls.
> 
> >
> > > So GVT-g
> > > has to maintain an internal abstraction layer to support both Xen and
> > > KVM. Maybe someday we will re-consider introducing some hypervisor
> > > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > > Xen guys are willing to support VFIO.
> >
> > Once upon a time, we had a KVM specific device assignment interface,
> > ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> > KVM out of the business of being a (bad) device driver.  We do have
> > some awareness and interaction between VFIO and KVM in the vfio-kvm
> > pseudo device, but we still try to keep those interfaces generic.  In
> > some cases we're not very successful at that, see vfio_group_set_kvm(),
> > but that's largely just a mechanism to associate a cookie with a group
> > to be consumed by the mdev vendor driver such that it can work with
> kvm
> > external to vfio.  I don't intend to add further hypervisor awareness
> > to vfio.
> >
> > > Back to this IOVA issue, I discussed with Yan and we found another
> > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > similar to VFIO - DMA also happens in the kernel, while it already
> > > supports vIOMMU.
> > >
> > > Generally speaking, there are three paths of dirty page collection
> > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > bitmap in GFN):
> >
> > GFNs or simply PFNs within an AddressSpace?
> >
> > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > > are updated directly when the guest memory is being updated. For
> > > example, PCI writes are completed through pci_dma_write, which
> > > goes through vIOMMU to translate IOVA into GPA and then update
> > > the bitmap through cpu_physical_memory_set_dirty_range.
> >
> > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > on the IOMMU AddressSpace.
> >
> > > 2) Memory writes that are not tracked by Qemu are collected by
> > > registering .log_sync() callback, which is invoked in the dirty logging
> > > process. Now there are two users: kvm and vhost.
> > >
> > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > > to Qemu when kvm_log_sync is invoked;
> > >
> > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > data structure. It maintains an internal iotlb which is synced with
> > > Qemu vIOMMU through a specific interface:
> > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > for Qemu to keep vhost iotlb in sync
> > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > a miss in vhost iotlb.
> > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > update the buffer (using internal iotlb to get GFN) when serving vring
> > > descriptor.
> > >
> > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > doesn't want to cache full-size vIOMMU mappings.
> > >
> > > Definitely this alternative needs more work and possibly less
> > > performant (if maintaining a small size iotlb) than straightforward
> > > calling into KVM interface. But the gain is also obvious, since it
> > > is fully constrained with VFIO.
> > >
> > > Thoughts? :-)
> >
> > So vhost must then be configuring a listener across system memory
> > rather than only against the device AddressSpace like we do in vfio,
> > such that it get's log_sync() callbacks for the actual GPA space rather
> > than only the IOVA space.  OTOH, QEMU could understand that the
> device
> > AddressSpace has a translate function and apply the IOVA dirty bits to
> > the system memory AddressSpace.  Wouldn't it make more sense for
> > QEMU
> > to perform a log_sync() prior to removing a MemoryRegionSection within
> > an AddressSpace and update the GPA rather than pushing GPA
> awareness
> > and potentially large tracking structures into the host kernel?  Thanks,
> >
> 

Hi, Alex,

I moved back the VFIO related discussion to this thread, to not mix with
vhost related discussions here.

https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg03126.html

Your latest reply still prefers to the userspace approach:

> > Same as last time, you're asking VFIO to be aware of an entirely new
> > address space and implement tracking structures of that address space
> > to make life easier for QEMU.  Don't we typically push such complexity
> > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> >

I answered two points but didn't hear your further thoughts. Can you
take a look and respond?

The first point is about complexity and performance:
> 
> Is it really complex? No need of a new tracking structure. Just allowing
> the MAP interface to carry a new parameter and then record it in the
> existing vfio_dma objects.
> 
> Note the frequency of guest DMA map/unmap could be very high. We
> saw >100K invocations per second with a 40G NIC. To do the right
> translation Qemu requires log_sync for every unmap, before the
> mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> each log_sync requires several system_calls through the migration
> info, e.g. setting start_pfn/page_size/total_pfns and then reading
> data_offset/data_size. That design is fine for doing log_sync in every
> pre-copy round, but too costly if doing so for every IOVA unmap. If
> small extension in kernel can lead to great overhead reduction,
> why not?
> 

The second point is about write-protection:

> There is another value of recording GPA in VFIO. Vendor drivers (e.g.
> GVT-g) may need to selectively write-protect guest memory pages
> when interpreting certain workload descriptors. Those pages are
> recorded in IOVA when vIOMMU is enabled, however the KVM 
> write-protection API only knows GPA. So currently vIOMMU must
> be disabled on Intel vGPUs when GVT-g is enabled. To make it working
> we need a way to translate IOVA into GPA in the vendor drivers. There
> are two options. One is having KVM export a new API for such 
> translation purpose. But as you explained earlier it's not good to
> have vendor drivers depend on KVM. The other is having VFIO
> maintaining such knowledge through extended MAP interface, 
> then providing a uniform API for all vendor drivers to use.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-24  2:19                 ` Tian, Kevin
@ 2019-09-24 18:03                   ` Alex Williamson
  2019-09-24 23:04                     ` Tian, Kevin
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Williamson @ 2019-09-24 18:03 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang, Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

On Tue, 24 Sep 2019 02:19:15 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Tian, Kevin
> > Sent: Friday, September 13, 2019 7:00 AM
> >   
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, September 12, 2019 10:41 PM
> > >
> > > On Tue, 3 Sep 2019 06:57:27 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >  
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > > >
> > > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >  
> > > > > > > From: Tian, Kevin
> > > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > > >  
> > > > > > [...]  
> > > > > > > > How does QEMU handle the fact that IOVAs are potentially  
> > > dynamic  
> > > > > while  
> > > > > > > > performing the live portion of a migration?  For example, each  
> > > time a  
> > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for  
> > the  
> > > device  
> > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > > > system_memory).  I don't see any QEMU code that intercepts  
> > that  
> > > > > change  
> > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be  
> > > recorded and  
> > > > > > > > translated to GFNs.  The vendor driver can't track these beyond  
> > > getting  
> > > > > > > > an unmap notification since it only knows the IOVA pfns, which  
> > > can be  
> > > > > > > > re-used with different GFN backing.  Once the DMA mapping is  
> > > torn  
> > > > > down,  
> > > > > > > > it seems those dirty pfns are lost in the ether.  If this works in  
> > > QEMU,  
> > > > > > > > please help me find the code that handles it.  
> > > > > > >
> > > > > > > I'm curious about this part too. Interestingly, I didn't find any  
> > > log_sync  
> > > > > > > callback registered by emulated devices in Qemu. Looks dirty  
> > pages  
> > > > > > > by emulated DMAs are recorded in some implicit way. But KVM  
> > > always  
> > > > > > > reports dirty page in GFN instead of IOVA, regardless of the  
> > > presence of  
> > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated  
> > DMAs  
> > > > > > >  (translation can be done when DMA happens), then we don't  
> > need  
> > > > > > > worry about transient mapping from IOVA to GFN. Along this way  
> > > we  
> > > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > > >  
> > > > > >
> > > > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.  
> > > > >
> > > > > I thought it was bad enough that we have vendor drivers that depend  
> > > on  
> > > > > KVM, but designing a vfio interface that only supports a KVM interface
> > > > > is more undesirable.  I also note without comment that  
> > > gfn_to_memslot()  
> > > > > is a GPL symbol.  Thanks,  
> > > >
> > > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > > back to 3yrs (when discussing the 1st mdev framework), there were  
> > > similar  
> > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > > > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > > > only flavor when implementing vfio_{un}pin_page_external.  
> > >
> > > Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> > > and
> > > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the  
> > HVA  
> > > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > > other vGPU mdev vendor manages to make use of this without KVM... the
> > > KVM interface used by GVT-g is GPL-only.  
> > 
> > To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> > GUP is a perfect example, which doesn't work for Xen since DomU's
> > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> > through Xen specific hypercalls.
> >   
> > >  
> > > > So GVT-g
> > > > has to maintain an internal abstraction layer to support both Xen and
> > > > KVM. Maybe someday we will re-consider introducing some hypervisor
> > > > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > > > Xen guys are willing to support VFIO.  
> > >
> > > Once upon a time, we had a KVM specific device assignment interface,
> > > ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> > > KVM out of the business of being a (bad) device driver.  We do have
> > > some awareness and interaction between VFIO and KVM in the vfio-kvm
> > > pseudo device, but we still try to keep those interfaces generic.  In
> > > some cases we're not very successful at that, see vfio_group_set_kvm(),
> > > but that's largely just a mechanism to associate a cookie with a group
> > > to be consumed by the mdev vendor driver such that it can work with  
> > kvm  
> > > external to vfio.  I don't intend to add further hypervisor awareness
> > > to vfio.
> > >  
> > > > Back to this IOVA issue, I discussed with Yan and we found another
> > > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > > similar to VFIO - DMA also happens in the kernel, while it already
> > > > supports vIOMMU.
> > > >
> > > > Generally speaking, there are three paths of dirty page collection
> > > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > > bitmap in GFN):  
> > >
> > > GFNs or simply PFNs within an AddressSpace?
> > >  
> > > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > > > are updated directly when the guest memory is being updated. For
> > > > example, PCI writes are completed through pci_dma_write, which
> > > > goes through vIOMMU to translate IOVA into GPA and then update
> > > > the bitmap through cpu_physical_memory_set_dirty_range.  
> > >
> > > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > > on the IOMMU AddressSpace.
> > >  
> > > > 2) Memory writes that are not tracked by Qemu are collected by
> > > > registering .log_sync() callback, which is invoked in the dirty logging
> > > > process. Now there are two users: kvm and vhost.
> > > >
> > > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > > > to Qemu when kvm_log_sync is invoked;
> > > >
> > > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > > data structure. It maintains an internal iotlb which is synced with
> > > > Qemu vIOMMU through a specific interface:
> > > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > > for Qemu to keep vhost iotlb in sync
> > > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > > a miss in vhost iotlb.
> > > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > > update the buffer (using internal iotlb to get GFN) when serving vring
> > > > descriptor.
> > > >
> > > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > > doesn't want to cache full-size vIOMMU mappings.
> > > >
> > > > Definitely this alternative needs more work and possibly less
> > > > performant (if maintaining a small size iotlb) than straightforward
> > > > calling into KVM interface. But the gain is also obvious, since it
> > > > is fully constrained with VFIO.
> > > >
> > > > Thoughts? :-)  
> > >
> > > So vhost must then be configuring a listener across system memory
> > > rather than only against the device AddressSpace like we do in vfio,
> > > such that it get's log_sync() callbacks for the actual GPA space rather
> > > than only the IOVA space.  OTOH, QEMU could understand that the  
> > device  
> > > AddressSpace has a translate function and apply the IOVA dirty bits to
> > > the system memory AddressSpace.  Wouldn't it make more sense for
> > > QEMU
> > > to perform a log_sync() prior to removing a MemoryRegionSection within
> > > an AddressSpace and update the GPA rather than pushing GPA  
> > awareness  
> > > and potentially large tracking structures into the host kernel?  Thanks,
> > >  
> >   
> 
> Hi, Alex,
> 
> I moved back the VFIO related discussion to this thread, to not mix with
> vhost related discussions here.
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg03126.html
> 
> Your latest reply still prefers to the userspace approach:
> 
> > > Same as last time, you're asking VFIO to be aware of an entirely new
> > > address space and implement tracking structures of that address space
> > > to make life easier for QEMU.  Don't we typically push such complexity
> > > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> > >  
> 
> I answered two points but didn't hear your further thoughts. Can you
> take a look and respond?
> 
> The first point is about complexity and performance:
> > 
> > Is it really complex? No need of a new tracking structure. Just allowing
> > the MAP interface to carry a new parameter and then record it in the
> > existing vfio_dma objects.
> > 
> > Note the frequency of guest DMA map/unmap could be very high. We
> > saw >100K invocations per second with a 40G NIC. To do the right
> > translation Qemu requires log_sync for every unmap, before the
> > mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> > each log_sync requires several system_calls through the migration
> > info, e.g. setting start_pfn/page_size/total_pfns and then reading
> > data_offset/data_size. That design is fine for doing log_sync in every
> > pre-copy round, but too costly if doing so for every IOVA unmap. If
> > small extension in kernel can lead to great overhead reduction,
> > why not?

You're citing a workload that already performs abysmally with vfio and
vIOMMU, we cannot handle those rates efficiently with the current vfio
DMA API.  The current use cases of vIOMMU and vfio are predominantly
for nesting vfio, ex. DPDK/SPDK, where we assume the mappings are
relatively static or else performance problems are already very
apparent.  In that sort of model, I don't see that QEMU doing a
log_sync on unmap is really an issue, unmaps should be relatively
rare.  Of course I don't want to compound the issue, but the current
vfio DMA mapping interfaces needs to be scrapped to make this remotely
performant even before we look at migration performance, so does it
really make sense to introduce GPAs for a workload the ioctls are
unsuited for?
 
> The second point is about write-protection:
> 
> > There is another value of recording GPA in VFIO. Vendor drivers
> > (e.g. GVT-g) may need to selectively write-protect guest memory
> > pages when interpreting certain workload descriptors. Those pages
> > are recorded in IOVA when vIOMMU is enabled, however the KVM 
> > write-protection API only knows GPA. So currently vIOMMU must
> > be disabled on Intel vGPUs when GVT-g is enabled. To make it working
> > we need a way to translate IOVA into GPA in the vendor drivers.
> > There are two options. One is having KVM export a new API for such 
> > translation purpose. But as you explained earlier it's not good to
> > have vendor drivers depend on KVM. The other is having VFIO
> > maintaining such knowledge through extended MAP interface, 
> > then providing a uniform API for all vendor drivers to use.  

So the argument is that in order to interact with KVM (write protecting
guest memory) there's a missing feature (IOVA to GPA translation), but
we don't want to add an API to KVM for this feature because that would
create a dependency on KVM (for interacting with KVM), so lets add an
API to vfio instead.  That makes no sense to me.  What am I missing?
Thanks,

Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-24 18:03                   ` Alex Williamson
@ 2019-09-24 23:04                     ` Tian, Kevin
  2019-09-25 19:06                       ` Alex Williamson
  0 siblings, 1 reply; 34+ messages in thread
From: Tian, Kevin @ 2019-09-24 23:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, September 25, 2019 2:03 AM
> 
> On Tue, 24 Sep 2019 02:19:15 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Tian, Kevin
> > > Sent: Friday, September 13, 2019 7:00 AM
> > >
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, September 12, 2019 10:41 PM
> > > >
> > > > On Tue, 3 Sep 2019 06:57:27 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > > > >
> > > > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > >
> > > > > > > > From: Tian, Kevin
> > > > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > > > >
> > > > > > > [...]
> > > > > > > > > How does QEMU handle the fact that IOVAs are potentially
> > > > dynamic
> > > > > > while
> > > > > > > > > performing the live portion of a migration?  For example,
> each
> > > > time a
> > > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for
> > > the
> > > > device
> > > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is
> not
> > > > > > > > > system_memory).  I don't see any QEMU code that intercepts
> > > that
> > > > > > change
> > > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> > > > recorded and
> > > > > > > > > translated to GFNs.  The vendor driver can't track these
> beyond
> > > > getting
> > > > > > > > > an unmap notification since it only knows the IOVA pfns,
> which
> > > > can be
> > > > > > > > > re-used with different GFN backing.  Once the DMA mapping
> is
> > > > torn
> > > > > > down,
> > > > > > > > > it seems those dirty pfns are lost in the ether.  If this works in
> > > > QEMU,
> > > > > > > > > please help me find the code that handles it.
> > > > > > > >
> > > > > > > > I'm curious about this part too. Interestingly, I didn't find any
> > > > log_sync
> > > > > > > > callback registered by emulated devices in Qemu. Looks dirty
> > > pages
> > > > > > > > by emulated DMAs are recorded in some implicit way. But KVM
> > > > always
> > > > > > > > reports dirty page in GFN instead of IOVA, regardless of the
> > > > presence of
> > > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated
> > > DMAs
> > > > > > > >  (translation can be done when DMA happens), then we don't
> > > need
> > > > > > > > worry about transient mapping from IOVA to GFN. Along this
> way
> > > > we
> > > > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > > > similar to what KVM does. For vendor drivers, it needs to
> translate
> > > > > > > > from IOVA to HVA to GFN when tracking DMA activities on
> VFIO
> > > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can
> be
> > > > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > > > >
> > > > > > >
> > > > > > > HVA->GFN can be done through hva_to_gfn_memslot in
> kvm_host.h.
> > > > > >
> > > > > > I thought it was bad enough that we have vendor drivers that
> depend
> > > > on
> > > > > > KVM, but designing a vfio interface that only supports a KVM
> interface
> > > > > > is more undesirable.  I also note without comment that
> > > > gfn_to_memslot()
> > > > > > is a GPL symbol.  Thanks,
> > > > >
> > > > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > > > back to 3yrs (when discussing the 1st mdev framework), there were
> > > > similar
> > > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but
> needs
> > > > > hypercall in Xen. but VFIO already makes assumption based on
> KVM-
> > > > > only flavor when implementing vfio_{un}pin_page_external.
> > > >
> > > > Where's the KVM assumption there?  The MAP_DMA ioctl takes an
> IOVA
> > > > and
> > > > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the
> > > HVA
> > > > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > > > other vGPU mdev vendor manages to make use of this without KVM...
> the
> > > > KVM interface used by GVT-g is GPL-only.
> > >
> > > To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> > > GUP is a perfect example, which doesn't work for Xen since DomU's
> > > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> > > through Xen specific hypercalls.
> > >
> > > >
> > > > > So GVT-g
> > > > > has to maintain an internal abstraction layer to support both Xen
> and
> > > > > KVM. Maybe someday we will re-consider introducing some
> hypervisor
> > > > > abstraction layer in VFIO, if this issue starts to hurt other devices
> and
> > > > > Xen guys are willing to support VFIO.
> > > >
> > > > Once upon a time, we had a KVM specific device assignment interface,
> > > > ie. legacy KVM devie assignment.  We developed VFIO specifically to
> get
> > > > KVM out of the business of being a (bad) device driver.  We do have
> > > > some awareness and interaction between VFIO and KVM in the vfio-
> kvm
> > > > pseudo device, but we still try to keep those interfaces generic.  In
> > > > some cases we're not very successful at that, see
> vfio_group_set_kvm(),
> > > > but that's largely just a mechanism to associate a cookie with a group
> > > > to be consumed by the mdev vendor driver such that it can work with
> > > kvm
> > > > external to vfio.  I don't intend to add further hypervisor awareness
> > > > to vfio.
> > > >
> > > > > Back to this IOVA issue, I discussed with Yan and we found another
> > > > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > > > similar to VFIO - DMA also happens in the kernel, while it already
> > > > > supports vIOMMU.
> > > > >
> > > > > Generally speaking, there are three paths of dirty page collection
> > > > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > > > bitmap in GFN):
> > > >
> > > > GFNs or simply PFNs within an AddressSpace?
> > > >
> > > > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty
> bitmaps
> > > > > are updated directly when the guest memory is being updated. For
> > > > > example, PCI writes are completed through pci_dma_write, which
> > > > > goes through vIOMMU to translate IOVA into GPA and then update
> > > > > the bitmap through cpu_physical_memory_set_dirty_range.
> > > >
> > > > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > > > on the IOMMU AddressSpace.
> > > >
> > > > > 2) Memory writes that are not tracked by Qemu are collected by
> > > > > registering .log_sync() callback, which is invoked in the dirty logging
> > > > > process. Now there are two users: kvm and vhost.
> > > > >
> > > > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > > > or EPT A/D bits (+PML). This part is always based on GFN and
> returned
> > > > > to Qemu when kvm_log_sync is invoked;
> > > > >
> > > > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > > > data structure. It maintains an internal iotlb which is synced with
> > > > > Qemu vIOMMU through a specific interface:
> > > > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > > > for Qemu to keep vhost iotlb in sync
> > > > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > > > a miss in vhost iotlb.
> > > > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > > > update the buffer (using internal iotlb to get GFN) when serving
> vring
> > > > > descriptor.
> > > > >
> > > > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > > > doesn't want to cache full-size vIOMMU mappings.
> > > > >
> > > > > Definitely this alternative needs more work and possibly less
> > > > > performant (if maintaining a small size iotlb) than straightforward
> > > > > calling into KVM interface. But the gain is also obvious, since it
> > > > > is fully constrained with VFIO.
> > > > >
> > > > > Thoughts? :-)
> > > >
> > > > So vhost must then be configuring a listener across system memory
> > > > rather than only against the device AddressSpace like we do in vfio,
> > > > such that it get's log_sync() callbacks for the actual GPA space rather
> > > > than only the IOVA space.  OTOH, QEMU could understand that the
> > > device
> > > > AddressSpace has a translate function and apply the IOVA dirty bits to
> > > > the system memory AddressSpace.  Wouldn't it make more sense for
> > > > QEMU
> > > > to perform a log_sync() prior to removing a MemoryRegionSection
> within
> > > > an AddressSpace and update the GPA rather than pushing GPA
> > > awareness
> > > > and potentially large tracking structures into the host kernel?  Thanks,
> > > >
> > >
> >
> > Hi, Alex,
> >
> > I moved back the VFIO related discussion to this thread, to not mix with
> > vhost related discussions here.
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> 09/msg03126.html
> >
> > Your latest reply still prefers to the userspace approach:
> >
> > > > Same as last time, you're asking VFIO to be aware of an entirely new
> > > > address space and implement tracking structures of that address
> space
> > > > to make life easier for QEMU.  Don't we typically push such complexity
> > > > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> > > >
> >
> > I answered two points but didn't hear your further thoughts. Can you
> > take a look and respond?
> >
> > The first point is about complexity and performance:
> > >
> > > Is it really complex? No need of a new tracking structure. Just allowing
> > > the MAP interface to carry a new parameter and then record it in the
> > > existing vfio_dma objects.
> > >
> > > Note the frequency of guest DMA map/unmap could be very high. We
> > > saw >100K invocations per second with a 40G NIC. To do the right
> > > translation Qemu requires log_sync for every unmap, before the
> > > mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> > > each log_sync requires several system_calls through the migration
> > > info, e.g. setting start_pfn/page_size/total_pfns and then reading
> > > data_offset/data_size. That design is fine for doing log_sync in every
> > > pre-copy round, but too costly if doing so for every IOVA unmap. If
> > > small extension in kernel can lead to great overhead reduction,
> > > why not?
> 
> You're citing a workload that already performs abysmally with vfio and
> vIOMMU, we cannot handle those rates efficiently with the current vfio
> DMA API.  The current use cases of vIOMMU and vfio are predominantly
> for nesting vfio, ex. DPDK/SPDK, where we assume the mappings are
> relatively static or else performance problems are already very
> apparent.  In that sort of model, I don't see that QEMU doing a
> log_sync on unmap is really an issue, unmaps should be relatively
> rare.  Of course I don't want to compound the issue, but the current
> vfio DMA mapping interfaces needs to be scrapped to make this remotely
> performant even before we look at migration performance, so does it
> really make sense to introduce GPAs for a workload the ioctls are
> unsuited for?
> 
> > The second point is about write-protection:
> >
> > > There is another value of recording GPA in VFIO. Vendor drivers
> > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > pages when interpreting certain workload descriptors. Those pages
> > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > write-protection API only knows GPA. So currently vIOMMU must
> > > be disabled on Intel vGPUs when GVT-g is enabled. To make it working
> > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > There are two options. One is having KVM export a new API for such
> > > translation purpose. But as you explained earlier it's not good to
> > > have vendor drivers depend on KVM. The other is having VFIO
> > > maintaining such knowledge through extended MAP interface,
> > > then providing a uniform API for all vendor drivers to use.
> 
> So the argument is that in order to interact with KVM (write protecting
> guest memory) there's a missing feature (IOVA to GPA translation), but
> we don't want to add an API to KVM for this feature because that would
> create a dependency on KVM (for interacting with KVM), so lets add an
> API to vfio instead.  That makes no sense to me.  What am I missing?
> Thanks,
> 

Then do you have a recommendation how such feature can be 
implemented cleanly in vendor driver, without introducing direct
dependency on KVM? 

Thanks
Kevin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-24 23:04                     ` Tian, Kevin
@ 2019-09-25 19:06                       ` Alex Williamson
  2019-09-26  3:07                         ` Tian, Kevin
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Williamson @ 2019-09-25 19:06 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang, Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

On Tue, 24 Sep 2019 23:04:22 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, September 25, 2019 2:03 AM
> > 
> > On Tue, 24 Sep 2019 02:19:15 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Tian, Kevin
> > > > Sent: Friday, September 13, 2019 7:00 AM
> > > >  
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Thursday, September 12, 2019 10:41 PM
> > > > >
> > > > > On Tue, 3 Sep 2019 06:57:27 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >  
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > > > > >
> > > > > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > > >  
> > > > > > > > > From: Tian, Kevin
> > > > > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > > > > >  
> > > > > > > > [...]  
> > > > > > > > > > How does QEMU handle the fact that IOVAs are potentially  
> > > > > dynamic  
> > > > > > > while  
> > > > > > > > > > performing the live portion of a migration?  For example,  
> > each  
> > > > > time a  
> > > > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for  
> > > > the  
> > > > > device  
> > > > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is  
> > not  
> > > > > > > > > > system_memory).  I don't see any QEMU code that intercepts  
> > > > that  
> > > > > > > change  
> > > > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be  
> > > > > recorded and  
> > > > > > > > > > translated to GFNs.  The vendor driver can't track these  
> > beyond  
> > > > > getting  
> > > > > > > > > > an unmap notification since it only knows the IOVA pfns,  
> > which  
> > > > > can be  
> > > > > > > > > > re-used with different GFN backing.  Once the DMA mapping  
> > is  
> > > > > torn  
> > > > > > > down,  
> > > > > > > > > > it seems those dirty pfns are lost in the ether.  If this works in  
> > > > > QEMU,  
> > > > > > > > > > please help me find the code that handles it.  
> > > > > > > > >
> > > > > > > > > I'm curious about this part too. Interestingly, I didn't find any  
> > > > > log_sync  
> > > > > > > > > callback registered by emulated devices in Qemu. Looks dirty  
> > > > pages  
> > > > > > > > > by emulated DMAs are recorded in some implicit way. But KVM  
> > > > > always  
> > > > > > > > > reports dirty page in GFN instead of IOVA, regardless of the  
> > > > > presence of  
> > > > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated  
> > > > DMAs  
> > > > > > > > >  (translation can be done when DMA happens), then we don't  
> > > > need  
> > > > > > > > > worry about transient mapping from IOVA to GFN. Along this  
> > way  
> > > > > we  
> > > > > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > > > > similar to what KVM does. For vendor drivers, it needs to  
> > translate  
> > > > > > > > > from IOVA to HVA to GFN when tracking DMA activities on  
> > VFIO  
> > > > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can  
> > be  
> > > > > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > > > > >  
> > > > > > > >
> > > > > > > > HVA->GFN can be done through hva_to_gfn_memslot in  
> > kvm_host.h.  
> > > > > > >
> > > > > > > I thought it was bad enough that we have vendor drivers that  
> > depend  
> > > > > on  
> > > > > > > KVM, but designing a vfio interface that only supports a KVM  
> > interface  
> > > > > > > is more undesirable.  I also note without comment that  
> > > > > gfn_to_memslot()  
> > > > > > > is a GPL symbol.  Thanks,  
> > > > > >
> > > > > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > > > > back to 3yrs (when discussing the 1st mdev framework), there were  
> > > > > similar  
> > > > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > > > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but  
> > needs  
> > > > > > hypercall in Xen. but VFIO already makes assumption based on  
> > KVM-  
> > > > > > only flavor when implementing vfio_{un}pin_page_external.  
> > > > >
> > > > > Where's the KVM assumption there?  The MAP_DMA ioctl takes an  
> > IOVA  
> > > > > and
> > > > > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the  
> > > > HVA  
> > > > > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > > > > other vGPU mdev vendor manages to make use of this without KVM...  
> > the  
> > > > > KVM interface used by GVT-g is GPL-only.  
> > > >
> > > > To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> > > > GUP is a perfect example, which doesn't work for Xen since DomU's
> > > > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> > > > through Xen specific hypercalls.
> > > >  
> > > > >  
> > > > > > So GVT-g
> > > > > > has to maintain an internal abstraction layer to support both Xen  
> > and  
> > > > > > KVM. Maybe someday we will re-consider introducing some  
> > hypervisor  
> > > > > > abstraction layer in VFIO, if this issue starts to hurt other devices  
> > and  
> > > > > > Xen guys are willing to support VFIO.  
> > > > >
> > > > > Once upon a time, we had a KVM specific device assignment interface,
> > > > > ie. legacy KVM devie assignment.  We developed VFIO specifically to  
> > get  
> > > > > KVM out of the business of being a (bad) device driver.  We do have
> > > > > some awareness and interaction between VFIO and KVM in the vfio-  
> > kvm  
> > > > > pseudo device, but we still try to keep those interfaces generic.  In
> > > > > some cases we're not very successful at that, see  
> > vfio_group_set_kvm(),  
> > > > > but that's largely just a mechanism to associate a cookie with a group
> > > > > to be consumed by the mdev vendor driver such that it can work with  
> > > > kvm  
> > > > > external to vfio.  I don't intend to add further hypervisor awareness
> > > > > to vfio.
> > > > >  
> > > > > > Back to this IOVA issue, I discussed with Yan and we found another
> > > > > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > > > > similar to VFIO - DMA also happens in the kernel, while it already
> > > > > > supports vIOMMU.
> > > > > >
> > > > > > Generally speaking, there are three paths of dirty page collection
> > > > > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > > > > bitmap in GFN):  
> > > > >
> > > > > GFNs or simply PFNs within an AddressSpace?
> > > > >  
> > > > > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty  
> > bitmaps  
> > > > > > are updated directly when the guest memory is being updated. For
> > > > > > example, PCI writes are completed through pci_dma_write, which
> > > > > > goes through vIOMMU to translate IOVA into GPA and then update
> > > > > > the bitmap through cpu_physical_memory_set_dirty_range.  
> > > > >
> > > > > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > > > > on the IOMMU AddressSpace.
> > > > >  
> > > > > > 2) Memory writes that are not tracked by Qemu are collected by
> > > > > > registering .log_sync() callback, which is invoked in the dirty logging
> > > > > > process. Now there are two users: kvm and vhost.
> > > > > >
> > > > > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > > > > or EPT A/D bits (+PML). This part is always based on GFN and  
> > returned  
> > > > > > to Qemu when kvm_log_sync is invoked;
> > > > > >
> > > > > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > > > > data structure. It maintains an internal iotlb which is synced with
> > > > > > Qemu vIOMMU through a specific interface:
> > > > > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > > > > for Qemu to keep vhost iotlb in sync
> > > > > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > > > > a miss in vhost iotlb.
> > > > > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > > > > update the buffer (using internal iotlb to get GFN) when serving  
> > vring  
> > > > > > descriptor.
> > > > > >
> > > > > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > > > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > > > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > > > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > > > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > > > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > > > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > > > > doesn't want to cache full-size vIOMMU mappings.
> > > > > >
> > > > > > Definitely this alternative needs more work and possibly less
> > > > > > performant (if maintaining a small size iotlb) than straightforward
> > > > > > calling into KVM interface. But the gain is also obvious, since it
> > > > > > is fully constrained with VFIO.
> > > > > >
> > > > > > Thoughts? :-)  
> > > > >
> > > > > So vhost must then be configuring a listener across system memory
> > > > > rather than only against the device AddressSpace like we do in vfio,
> > > > > such that it get's log_sync() callbacks for the actual GPA space rather
> > > > > than only the IOVA space.  OTOH, QEMU could understand that the  
> > > > device  
> > > > > AddressSpace has a translate function and apply the IOVA dirty bits to
> > > > > the system memory AddressSpace.  Wouldn't it make more sense for
> > > > > QEMU
> > > > > to perform a log_sync() prior to removing a MemoryRegionSection  
> > within  
> > > > > an AddressSpace and update the GPA rather than pushing GPA  
> > > > awareness  
> > > > > and potentially large tracking structures into the host kernel?  Thanks,
> > > > >  
> > > >  
> > >
> > > Hi, Alex,
> > >
> > > I moved back the VFIO related discussion to this thread, to not mix with
> > > vhost related discussions here.
> > >
> > > https://lists.nongnu.org/archive/html/qemu-devel/2019-  
> > 09/msg03126.html  
> > >
> > > Your latest reply still prefers to the userspace approach:
> > >  
> > > > > Same as last time, you're asking VFIO to be aware of an entirely new
> > > > > address space and implement tracking structures of that address  
> > space  
> > > > > to make life easier for QEMU.  Don't we typically push such complexity
> > > > > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> > > > >  
> > >
> > > I answered two points but didn't hear your further thoughts. Can you
> > > take a look and respond?
> > >
> > > The first point is about complexity and performance:  
> > > >
> > > > Is it really complex? No need of a new tracking structure. Just allowing
> > > > the MAP interface to carry a new parameter and then record it in the
> > > > existing vfio_dma objects.
> > > >
> > > > Note the frequency of guest DMA map/unmap could be very high. We
> > > > saw >100K invocations per second with a 40G NIC. To do the right
> > > > translation Qemu requires log_sync for every unmap, before the
> > > > mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> > > > each log_sync requires several system_calls through the migration
> > > > info, e.g. setting start_pfn/page_size/total_pfns and then reading
> > > > data_offset/data_size. That design is fine for doing log_sync in every
> > > > pre-copy round, but too costly if doing so for every IOVA unmap. If
> > > > small extension in kernel can lead to great overhead reduction,
> > > > why not?  
> > 
> > You're citing a workload that already performs abysmally with vfio and
> > vIOMMU, we cannot handle those rates efficiently with the current vfio
> > DMA API.  The current use cases of vIOMMU and vfio are predominantly
> > for nesting vfio, ex. DPDK/SPDK, where we assume the mappings are
> > relatively static or else performance problems are already very
> > apparent.  In that sort of model, I don't see that QEMU doing a
> > log_sync on unmap is really an issue, unmaps should be relatively
> > rare.  Of course I don't want to compound the issue, but the current
> > vfio DMA mapping interfaces needs to be scrapped to make this remotely
> > performant even before we look at migration performance, so does it
> > really make sense to introduce GPAs for a workload the ioctls are
> > unsuited for?
> >   
> > > The second point is about write-protection:
> > >  
> > > > There is another value of recording GPA in VFIO. Vendor drivers
> > > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > > pages when interpreting certain workload descriptors. Those pages
> > > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > > write-protection API only knows GPA. So currently vIOMMU must
> > > > be disabled on Intel vGPUs when GVT-g is enabled. To make it working
> > > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > > There are two options. One is having KVM export a new API for such
> > > > translation purpose. But as you explained earlier it's not good to
> > > > have vendor drivers depend on KVM. The other is having VFIO
> > > > maintaining such knowledge through extended MAP interface,
> > > > then providing a uniform API for all vendor drivers to use.  
> > 
> > So the argument is that in order to interact with KVM (write protecting
> > guest memory) there's a missing feature (IOVA to GPA translation), but
> > we don't want to add an API to KVM for this feature because that would
> > create a dependency on KVM (for interacting with KVM), so lets add an
> > API to vfio instead.  That makes no sense to me.  What am I missing?
> > Thanks,
> >   
> 
> Then do you have a recommendation how such feature can be 
> implemented cleanly in vendor driver, without introducing direct
> dependency on KVM? 

I think the disconnect is that these sorts of extensions don't reflect
things that a physical device can actually do.  The idea of vfio is
that it's a userspace driver interface.  It provides a channel for the
user to interact with the device, map device resources, receive
interrupts, map system memory through the iommu, etc.  Mediated devices
augment this by replacing the physical device the user accesses with a
software virtualized device.  So then the question becomes why this
device virtualizing software, ie. the mdev vendor driver, needs to do
things that a physical device clearly cannot do.  For example, how can
a physical device write-protect portions of system memory?  Or even,
why would it need to?  It makes me suspect that mdev is being used to
bypass the hypervisor, or maybe fill in the gaps for hardware that
isn't as "mediation friendly" as it claims to be.

In the case of a physical device discovering an iova translation, this
is what device iotlbs are for, but as an acceleration and offload
mechanism for the system iommu rather than a lookup mechanism as seems
to be wanted here.  If we had a system iommu with dirty page tracking,
I believe that tracking would live in the iommu page tables and
therefore reflect dirty pages relative to iova.  We'd need to consume
those dirty page bits before we tear down the iova mappings, much like
we're suggesting QEMU do here.

Unfortunately I also think that KVM and vhost are not really the best
examples of what we need to do for vfio.  KVM is intimately involved
with GPAs, so clearly dirty page tracking at that level is not an
issue.  Vhost tends to circumvent the viommu; it's trying to poke
directly into guest memory without the help of a physical iommu.  So
I can't say that I have much faith that QEMU is already properly wired
with respect to viommu and dirty page tracking, leaving open the
possibility that a log_sync on iommu region unmap is simply a gap in
the QEMU migration story.  The vfio migration interface we have on the
table seems like it could work, but QEMU needs an update and we need to
define the interface in terms of pfns relative to the address space.

If GPAs are still needed, what are they for?  The write-protect example
is clearly a hypervisor level interaction as I assume it's write
protection relative to the vCPU.  It's a hypervisor specific interface
to perform that write-protection, so why wouldn't we use a hypervisor
specific interface to collect the data to perform that operation?
IOW, if GVT-g already has a KVM dependency, why be concerned about
adding another GVT-g KVM dependency?  It seems like vfio is just a
potentially convenient channel, but as discussed above, vfio has no
business in GPAs because devices don't operate on GPAs and I've not
been sold that there's value in vfio getting involved in that address
space.  Convince me otherwise ;)  Thanks,

Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-25 19:06                       ` Alex Williamson
@ 2019-09-26  3:07                         ` Tian, Kevin
  2019-09-26 21:33                           ` Alex Williamson
  0 siblings, 1 reply; 34+ messages in thread
From: Tian, Kevin @ 2019-09-26  3:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, September 26, 2019 3:06 AM
[...]
> > > > The second point is about write-protection:
> > > >
> > > > > There is another value of recording GPA in VFIO. Vendor drivers
> > > > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > > > pages when interpreting certain workload descriptors. Those pages
> > > > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > > > write-protection API only knows GPA. So currently vIOMMU must
> > > > > be disabled on Intel vGPUs when GVT-g is enabled. To make it
> working
> > > > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > > > There are two options. One is having KVM export a new API for such
> > > > > translation purpose. But as you explained earlier it's not good to
> > > > > have vendor drivers depend on KVM. The other is having VFIO
> > > > > maintaining such knowledge through extended MAP interface,
> > > > > then providing a uniform API for all vendor drivers to use.
> > >
> > > So the argument is that in order to interact with KVM (write protecting
> > > guest memory) there's a missing feature (IOVA to GPA translation), but
> > > we don't want to add an API to KVM for this feature because that would
> > > create a dependency on KVM (for interacting with KVM), so lets add an
> > > API to vfio instead.  That makes no sense to me.  What am I missing?
> > > Thanks,
> > >
> >
> > Then do you have a recommendation how such feature can be
> > implemented cleanly in vendor driver, without introducing direct
> > dependency on KVM?
> 
> I think the disconnect is that these sorts of extensions don't reflect
> things that a physical device can actually do.  The idea of vfio is
> that it's a userspace driver interface.  It provides a channel for the
> user to interact with the device, map device resources, receive
> interrupts, map system memory through the iommu, etc.  Mediated
> devices
> augment this by replacing the physical device the user accesses with a
> software virtualized device.  So then the question becomes why this
> device virtualizing software, ie. the mdev vendor driver, needs to do
> things that a physical device clearly cannot do.  For example, how can
> a physical device write-protect portions of system memory?  Or even,
> why would it need to?  It makes me suspect that mdev is being used to
> bypass the hypervisor, or maybe fill in the gaps for hardware that
> isn't as "mediation friendly" as it claims to be.

We do have one such example on Intel GPU. To support direct cmd
submission from userspace (SVA), kernel driver allocates a doorbell
page (in system memory) for each application and then registers
the page to the GPU. Once the doorbell is armed, the GPU starts
to monitor CPU writes to that page. Then the application can ring the 
GPU by simply writing to the doorbell page to submit cmds. This
possibly makes sense only for integrated devices.

In case that direction submission is not allowed in mediated device
(some auditing work is required in GVT-g), we need to write-protect 
the doorbell page with hypervisor help to mimic the hardware 
behavior. We have prototype work internally, but hasn't sent it out.

> 
> In the case of a physical device discovering an iova translation, this
> is what device iotlbs are for, but as an acceleration and offload
> mechanism for the system iommu rather than a lookup mechanism as
> seems
> to be wanted here.  If we had a system iommu with dirty page tracking,
> I believe that tracking would live in the iommu page tables and
> therefore reflect dirty pages relative to iova.  We'd need to consume
> those dirty page bits before we tear down the iova mappings, much like
> we're suggesting QEMU do here.

Yes. There are two cases:

1) iova shadowing. Say using only 2nd level as today. Here the dirty 
bits are associated to iova. When Qemu is revised to invoke log_sync 
before tearing down any iova mapping, vfio can get the dirty info 
from iommu driver for affected range.

2) iova nesting, where iova->gpa is in 1st level and gpa->hpa is in
2nd level. In that case the iova carried in the map/unmap ioctl is
actually gpa, thus the dirty bits are associated to gpa. In such case,
Qemu should continue to consume gpa-based dirty bitmap, as if
viommu is disabled.

> 
> Unfortunately I also think that KVM and vhost are not really the best
> examples of what we need to do for vfio.  KVM is intimately involved
> with GPAs, so clearly dirty page tracking at that level is not an
> issue.  Vhost tends to circumvent the viommu; it's trying to poke
> directly into guest memory without the help of a physical iommu.  So
> I can't say that I have much faith that QEMU is already properly wired
> with respect to viommu and dirty page tracking, leaving open the
> possibility that a log_sync on iommu region unmap is simply a gap in
> the QEMU migration story.  The vfio migration interface we have on the
> table seems like it could work, but QEMU needs an update and we need to
> define the interface in terms of pfns relative to the address space.

Yan and I did a brief discussion on this. Besides the basic change of
doing log_sync for every iova unmap, there are two others gaps to
be fixed:

1) Today the iova->gpa mapping is maintained in two places: viommu 
page table in guest memory and viotlb in Qemu. viotlb is filled when 
a walk on the viommu page table happens, due to emulation of a virtual
DMA operation from emulated devices or request from vhost devices. 
It's not affected by passthrough device activities though, since the latter 
goes through physical iommu. Per iommu spec, guest iommu driver 
first clears the viommu page table, followed by viotlb invalidation 
request. It's the latter being trapped by Qemu, then vfio is notified 
at that point, where iova->gpa translation will simply fail since no 
valid mapping in viommu page table and very likely no hit in viotlb. 
To fix this gap, we need extend Qemu to cache all the valid iova 
mappings in viommu page table, similar to how vfio does.

2) Then there will be parallel log_sync requests on each vfio device. 
One is from the vcpu thread, when iotlb invalidation request is being 
emulated. The other is from the migration thread, where log_sync is 
requested for the entire guest memory in iterative copies. The 
contention among multiple vCPU threads is already protected through 
iommu lock, but we didn't find such thing between migration thread 
and vcpu threads. Maybe we overlooked something, but ideally the 
whole iova address space should be locked when the migration thread 
is doing mega-sync/translation.

+Yi and Peter for their opinions.

> 
> If GPAs are still needed, what are they for?  The write-protect example
> is clearly a hypervisor level interaction as I assume it's write
> protection relative to the vCPU.  It's a hypervisor specific interface
> to perform that write-protection, so why wouldn't we use a hypervisor
> specific interface to collect the data to perform that operation?
> IOW, if GVT-g already has a KVM dependency, why be concerned about
> adding another GVT-g KVM dependency?  It seems like vfio is just a

This is possibly the way that we have to go, based on discussions
so far. Earlier I just hold the same argument as you emphasized
for vfio - although there are existing KVM dependencies, we want
minimize it. :-) Another worry is what if other vendor drivers may
have similar requirements, then can we invent some generic ways
thus avoid pushing them to do same tricky thing again. Of course, 
we may revisit it later until this issue does become a common 
requirement.

> potentially convenient channel, but as discussed above, vfio has no
> business in GPAs because devices don't operate on GPAs and I've not
> been sold that there's value in vfio getting involved in that address
> space.  Convince me otherwise ;)  Thanks,
> 

Looks none of my arguments is convincible to you :-), so we move
to investigate what should be changed in qemu to support your 
proposal (as discussed above). While this part is on-going, let me
have a last try on my original idea. ;) Just be curious how your
further thought is, regarding to earlier doorbell monitoring 
example for operating GPA on device. If it's just Intel GPU only thing, 
yes we can still fix it in GVT-g itself as you suggested. But I'm just not 
sure about other integrated devices, and also new accelerators 
with coherent bus connected to cpu package. Also we don't need 
call it GPA - it could be named as user_target_address that the iova 
is mapped to, and is the address space that userspace expects
the device to operate for purposes (logging, monitoring, etc.) other 
than for dma (using iova) and for accessing userspace/guest
memory (hva).

Thanks
Kevin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-26  3:07                         ` Tian, Kevin
@ 2019-09-26 21:33                           ` Alex Williamson
  2019-10-24 11:41                             ` Tian, Kevin
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Williamson @ 2019-09-26 21:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang, Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

On Thu, 26 Sep 2019 03:07:08 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, September 26, 2019 3:06 AM  
> [...]
> > > > > The second point is about write-protection:
> > > > >  
> > > > > > There is another value of recording GPA in VFIO. Vendor drivers
> > > > > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > > > > pages when interpreting certain workload descriptors. Those pages
> > > > > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > > > > write-protection API only knows GPA. So currently vIOMMU must
> > > > > > be disabled on Intel vGPUs when GVT-g is enabled. To make it  
> > working  
> > > > > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > > > > There are two options. One is having KVM export a new API for such
> > > > > > translation purpose. But as you explained earlier it's not good to
> > > > > > have vendor drivers depend on KVM. The other is having VFIO
> > > > > > maintaining such knowledge through extended MAP interface,
> > > > > > then providing a uniform API for all vendor drivers to use.  
> > > >
> > > > So the argument is that in order to interact with KVM (write protecting
> > > > guest memory) there's a missing feature (IOVA to GPA translation), but
> > > > we don't want to add an API to KVM for this feature because that would
> > > > create a dependency on KVM (for interacting with KVM), so lets add an
> > > > API to vfio instead.  That makes no sense to me.  What am I missing?
> > > > Thanks,
> > > >  
> > >
> > > Then do you have a recommendation how such feature can be
> > > implemented cleanly in vendor driver, without introducing direct
> > > dependency on KVM?  
> > 
> > I think the disconnect is that these sorts of extensions don't reflect
> > things that a physical device can actually do.  The idea of vfio is
> > that it's a userspace driver interface.  It provides a channel for the
> > user to interact with the device, map device resources, receive
> > interrupts, map system memory through the iommu, etc.  Mediated
> > devices
> > augment this by replacing the physical device the user accesses with a
> > software virtualized device.  So then the question becomes why this
> > device virtualizing software, ie. the mdev vendor driver, needs to do
> > things that a physical device clearly cannot do.  For example, how can
> > a physical device write-protect portions of system memory?  Or even,
> > why would it need to?  It makes me suspect that mdev is being used to
> > bypass the hypervisor, or maybe fill in the gaps for hardware that
> > isn't as "mediation friendly" as it claims to be.  
> 
> We do have one such example on Intel GPU. To support direct cmd
> submission from userspace (SVA), kernel driver allocates a doorbell
> page (in system memory) for each application and then registers
> the page to the GPU. Once the doorbell is armed, the GPU starts
> to monitor CPU writes to that page. Then the application can ring the 
> GPU by simply writing to the doorbell page to submit cmds. This
> possibly makes sense only for integrated devices.
> 
> In case that direction submission is not allowed in mediated device
> (some auditing work is required in GVT-g), we need to write-protect 
> the doorbell page with hypervisor help to mimic the hardware 
> behavior. We have prototype work internally, but hasn't sent it out.

What would it look like for QEMU to orchestrate this?  Maybe the mdev
device could expose a doorbell page as a device specific region.
Possibly a quirk in QEMU vfio-pci could detect the guest driver
registering a doorbell (or vendor driver could tell QEMU via a device
specific interrupt) and setup a MemoryRegion overlay of the doorbell
page in the guest.  If the physical GPU has the resources, maybe the
doorbell page is real, otherwise it could be emulated in the vendor
driver.  Trying to do this without QEMU seems to be where we're running
into trouble and is what I'd classify as bypassing the hypervisor.

> > In the case of a physical device discovering an iova translation, this
> > is what device iotlbs are for, but as an acceleration and offload
> > mechanism for the system iommu rather than a lookup mechanism as
> > seems
> > to be wanted here.  If we had a system iommu with dirty page tracking,
> > I believe that tracking would live in the iommu page tables and
> > therefore reflect dirty pages relative to iova.  We'd need to consume
> > those dirty page bits before we tear down the iova mappings, much like
> > we're suggesting QEMU do here.  
> 
> Yes. There are two cases:
> 
> 1) iova shadowing. Say using only 2nd level as today. Here the dirty 
> bits are associated to iova. When Qemu is revised to invoke log_sync 
> before tearing down any iova mapping, vfio can get the dirty info 
> from iommu driver for affected range.

Maybe we need two mechanisms, log_sync for the "occasional" polling of
dirty bits and an UNMAP_DMA ioctl extension that allows the user to
provide a buffer into which the unmap ioctl would set dirty bits.
Userspace could potentially chunk MAP/UNMAP_DMA calls in order to bound
the size of the bitmap buffer (modulo the difficulties of assuming
physical page alignment).  The QEMU vfio-pci driver would then perform
the translation and mark the GPA pages dirty.  Seems that might relieve
your efficiency concerns.  The unpin path and unpin notifier would need
to relay dirty info for mdev support, where a generic implementation
might simply assume everything that has been or is currently pinned by
mdev is dirty.

Now I'm wondering if we should be consolidating vfio dirty page
tracking per container rather than per device if we do something like
this...

> 2) iova nesting, where iova->gpa is in 1st level and gpa->hpa is in
> 2nd level. In that case the iova carried in the map/unmap ioctl is
> actually gpa, thus the dirty bits are associated to gpa. In such case,
> Qemu should continue to consume gpa-based dirty bitmap, as if
> viommu is disabled.

Seems again that it's QEMU that really knows which AddressSpace that
the kernel vfio is operating in and how to apply it to a dirty bitmap.

> > Unfortunately I also think that KVM and vhost are not really the best
> > examples of what we need to do for vfio.  KVM is intimately involved
> > with GPAs, so clearly dirty page tracking at that level is not an
> > issue.  Vhost tends to circumvent the viommu; it's trying to poke
> > directly into guest memory without the help of a physical iommu.  So
> > I can't say that I have much faith that QEMU is already properly wired
> > with respect to viommu and dirty page tracking, leaving open the
> > possibility that a log_sync on iommu region unmap is simply a gap in
> > the QEMU migration story.  The vfio migration interface we have on the
> > table seems like it could work, but QEMU needs an update and we need to
> > define the interface in terms of pfns relative to the address space.  
> 
> Yan and I did a brief discussion on this. Besides the basic change of
> doing log_sync for every iova unmap, there are two others gaps to
> be fixed:
> 
> 1) Today the iova->gpa mapping is maintained in two places: viommu 
> page table in guest memory and viotlb in Qemu. viotlb is filled when 
> a walk on the viommu page table happens, due to emulation of a virtual
> DMA operation from emulated devices or request from vhost devices. 
> It's not affected by passthrough device activities though, since the latter 
> goes through physical iommu. Per iommu spec, guest iommu driver 
> first clears the viommu page table, followed by viotlb invalidation 
> request. It's the latter being trapped by Qemu, then vfio is notified 
> at that point, where iova->gpa translation will simply fail since no 
> valid mapping in viommu page table and very likely no hit in viotlb. 
> To fix this gap, we need extend Qemu to cache all the valid iova 
> mappings in viommu page table, similar to how vfio does.
> 
> 2) Then there will be parallel log_sync requests on each vfio device. 
> One is from the vcpu thread, when iotlb invalidation request is being 
> emulated. The other is from the migration thread, where log_sync is 
> requested for the entire guest memory in iterative copies. The 
> contention among multiple vCPU threads is already protected through 
> iommu lock, but we didn't find such thing between migration thread 
> and vcpu threads. Maybe we overlooked something, but ideally the 
> whole iova address space should be locked when the migration thread 
> is doing mega-sync/translation.
> 
> +Yi and Peter for their opinions.

Good points, not sure I have anything to add to that.  As above, we can
think about whether the device or the container is the right place to
do dirty page tracking.

> > If GPAs are still needed, what are they for?  The write-protect example
> > is clearly a hypervisor level interaction as I assume it's write
> > protection relative to the vCPU.  It's a hypervisor specific interface
> > to perform that write-protection, so why wouldn't we use a hypervisor
> > specific interface to collect the data to perform that operation?
> > IOW, if GVT-g already has a KVM dependency, why be concerned about
> > adding another GVT-g KVM dependency?  It seems like vfio is just a  
> 
> This is possibly the way that we have to go, based on discussions
> so far. Earlier I just hold the same argument as you emphasized
> for vfio - although there are existing KVM dependencies, we want
> minimize it. :-) Another worry is what if other vendor drivers may
> have similar requirements, then can we invent some generic ways
> thus avoid pushing them to do same tricky thing again. Of course, 
> we may revisit it later until this issue does become a common 
> requirement.

I know we don't see quite the same on the degree to which vfio has KVM
dependencies, but it's possible, and regularly done, to use vfio without
KVM.  The same cannot be said of GVT-g.  Therefore it's not very
motivating for me to entertain a proposal for a new vfio interface
whose main purpose is to fill in a gap in the already existing GVT-g
dependency on KVM (and in no way reduce that existing KVM dependency).
I maintain vfio's lack of dependencies on KVM either way ;)
 
> > potentially convenient channel, but as discussed above, vfio has no
> > business in GPAs because devices don't operate on GPAs and I've not
> > been sold that there's value in vfio getting involved in that
> > address space.  Convince me otherwise ;)  Thanks,
> >   
> 
> Looks none of my arguments is convincible to you :-), so we move
> to investigate what should be changed in qemu to support your 
> proposal (as discussed above). While this part is on-going, let me
> have a last try on my original idea. ;) Just be curious how your
> further thought is, regarding to earlier doorbell monitoring 
> example for operating GPA on device. If it's just Intel GPU only
> thing, yes we can still fix it in GVT-g itself as you suggested. But
> I'm just not sure about other integrated devices, and also new
> accelerators with coherent bus connected to cpu package. Also we
> don't need call it GPA - it could be named as user_target_address
> that the iova is mapped to, and is the address space that userspace
> expects the device to operate for purposes (logging, monitoring,
> etc.) other than for dma (using iova) and for accessing
> userspace/guest memory (hva).

My impression of the doorbell example is that we're taking the wrong
approach to solving how that interaction should work.  To setup an
interaction between a device and a vCPU, we should be going through
QEMU.  This is the model we've used from the start with vfio.  KVM is
an accelerator, but it's up to QEMU to connect KVM and vfio together.
I'm curious if there's any merit to my proposal above about how this
could be redesigned.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v8 01/13] vfio: KABI for migration interface
  2019-09-26 21:33                           ` Alex Williamson
@ 2019-10-24 11:41                             ` Tian, Kevin
  0 siblings, 0 replies; 34+ messages in thread
From: Tian, Kevin @ 2019-10-24 11:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, qemu-devel, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, Zhao,
	Yan Y, Liu, Changpeng, Ken.Xue

Sorry for late reply...

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, September 27, 2019 5:33 AM
> 
> On Thu, 26 Sep 2019 03:07:08 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, September 26, 2019 3:06 AM
> > [...]
> > > > > > The second point is about write-protection:
> > > > > >
> > > > > > > There is another value of recording GPA in VFIO. Vendor drivers
> > > > > > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > > > > > pages when interpreting certain workload descriptors. Those
> pages
> > > > > > > are recorded in IOVA when vIOMMU is enabled, however the
> KVM
> > > > > > > write-protection API only knows GPA. So currently vIOMMU must
> > > > > > > be disabled on Intel vGPUs when GVT-g is enabled. To make it
> > > working
> > > > > > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > > > > > There are two options. One is having KVM export a new API for
> such
> > > > > > > translation purpose. But as you explained earlier it's not good to
> > > > > > > have vendor drivers depend on KVM. The other is having VFIO
> > > > > > > maintaining such knowledge through extended MAP interface,
> > > > > > > then providing a uniform API for all vendor drivers to use.
> > > > >
> > > > > So the argument is that in order to interact with KVM (write
> protecting
> > > > > guest memory) there's a missing feature (IOVA to GPA translation),
> but
> > > > > we don't want to add an API to KVM for this feature because that
> would
> > > > > create a dependency on KVM (for interacting with KVM), so lets add
> an
> > > > > API to vfio instead.  That makes no sense to me.  What am I missing?
> > > > > Thanks,
> > > > >
> > > >
> > > > Then do you have a recommendation how such feature can be
> > > > implemented cleanly in vendor driver, without introducing direct
> > > > dependency on KVM?
> > >
> > > I think the disconnect is that these sorts of extensions don't reflect
> > > things that a physical device can actually do.  The idea of vfio is
> > > that it's a userspace driver interface.  It provides a channel for the
> > > user to interact with the device, map device resources, receive
> > > interrupts, map system memory through the iommu, etc.  Mediated
> > > devices
> > > augment this by replacing the physical device the user accesses with a
> > > software virtualized device.  So then the question becomes why this
> > > device virtualizing software, ie. the mdev vendor driver, needs to do
> > > things that a physical device clearly cannot do.  For example, how can
> > > a physical device write-protect portions of system memory?  Or even,
> > > why would it need to?  It makes me suspect that mdev is being used to
> > > bypass the hypervisor, or maybe fill in the gaps for hardware that
> > > isn't as "mediation friendly" as it claims to be.
> >
> > We do have one such example on Intel GPU. To support direct cmd
> > submission from userspace (SVA), kernel driver allocates a doorbell
> > page (in system memory) for each application and then registers
> > the page to the GPU. Once the doorbell is armed, the GPU starts
> > to monitor CPU writes to that page. Then the application can ring the
> > GPU by simply writing to the doorbell page to submit cmds. This
> > possibly makes sense only for integrated devices.
> >
> > In case that direction submission is not allowed in mediated device
> > (some auditing work is required in GVT-g), we need to write-protect
> > the doorbell page with hypervisor help to mimic the hardware
> > behavior. We have prototype work internally, but hasn't sent it out.
> 
> What would it look like for QEMU to orchestrate this?  Maybe the mdev
> device could expose a doorbell page as a device specific region.
> Possibly a quirk in QEMU vfio-pci could detect the guest driver
> registering a doorbell (or vendor driver could tell QEMU via a device
> specific interrupt) and setup a MemoryRegion overlay of the doorbell
> page in the guest.  If the physical GPU has the resources, maybe the
> doorbell page is real, otherwise it could be emulated in the vendor
> driver.  Trying to do this without QEMU seems to be where we're running
> into trouble and is what I'd classify as bypassing the hypervisor.

en, let's think about this approach. We are used to solving all the
problems in kernel (thus deeply relying on what VFIO affords). It
would be a different story by offloading some mediation knowledge 
to userspace. This is possibly fine for doorbell virtualization, as its
registration interface is not complex. But it may not work well for
device page table shadowing. It's hard to split this knowledge 
between user/kernel, while simply notifying Qemu to request 
write-protection is likely slow. But as said, this is a good direction
for further study. :-)

> 
> > > In the case of a physical device discovering an iova translation, this
> > > is what device iotlbs are for, but as an acceleration and offload
> > > mechanism for the system iommu rather than a lookup mechanism as
> > > seems
> > > to be wanted here.  If we had a system iommu with dirty page tracking,
> > > I believe that tracking would live in the iommu page tables and
> > > therefore reflect dirty pages relative to iova.  We'd need to consume
> > > those dirty page bits before we tear down the iova mappings, much like
> > > we're suggesting QEMU do here.
> >
> > Yes. There are two cases:
> >
> > 1) iova shadowing. Say using only 2nd level as today. Here the dirty
> > bits are associated to iova. When Qemu is revised to invoke log_sync
> > before tearing down any iova mapping, vfio can get the dirty info
> > from iommu driver for affected range.
> 
> Maybe we need two mechanisms, log_sync for the "occasional" polling of
> dirty bits and an UNMAP_DMA ioctl extension that allows the user to
> provide a buffer into which the unmap ioctl would set dirty bits.
> Userspace could potentially chunk MAP/UNMAP_DMA calls in order to
> bound
> the size of the bitmap buffer (modulo the difficulties of assuming
> physical page alignment).  The QEMU vfio-pci driver would then perform
> the translation and mark the GPA pages dirty.  Seems that might relieve
> your efficiency concerns.  The unpin path and unpin notifier would need

yes, this approach can fix my concern. It carries the dirty info
in existing syscalls thus is much better than tying to the migration
region interface.

> to relay dirty info for mdev support, where a generic implementation
> might simply assume everything that has been or is currently pinned by
> mdev is dirty.
> 
> Now I'm wondering if we should be consolidating vfio dirty page
> tracking per container rather than per device if we do something like
> this...

I sort of agree. UNMAP_DMA is per container by definition. If
we continue to keep log_sync path per device (through migration
region), it looks inconsistent. Possibly we should just keep the
migration region solely for poking the device state, while leaving
dirty page tracking through the container interface.

> 
> > 2) iova nesting, where iova->gpa is in 1st level and gpa->hpa is in
> > 2nd level. In that case the iova carried in the map/unmap ioctl is
> > actually gpa, thus the dirty bits are associated to gpa. In such case,
> > Qemu should continue to consume gpa-based dirty bitmap, as if
> > viommu is disabled.
> 
> Seems again that it's QEMU that really knows which AddressSpace that
> the kernel vfio is operating in and how to apply it to a dirty bitmap.
> 
> > > Unfortunately I also think that KVM and vhost are not really the best
> > > examples of what we need to do for vfio.  KVM is intimately involved
> > > with GPAs, so clearly dirty page tracking at that level is not an
> > > issue.  Vhost tends to circumvent the viommu; it's trying to poke
> > > directly into guest memory without the help of a physical iommu.  So
> > > I can't say that I have much faith that QEMU is already properly wired
> > > with respect to viommu and dirty page tracking, leaving open the
> > > possibility that a log_sync on iommu region unmap is simply a gap in
> > > the QEMU migration story.  The vfio migration interface we have on the
> > > table seems like it could work, but QEMU needs an update and we need
> to
> > > define the interface in terms of pfns relative to the address space.
> >
> > Yan and I did a brief discussion on this. Besides the basic change of
> > doing log_sync for every iova unmap, there are two others gaps to
> > be fixed:
> >
> > 1) Today the iova->gpa mapping is maintained in two places: viommu
> > page table in guest memory and viotlb in Qemu. viotlb is filled when
> > a walk on the viommu page table happens, due to emulation of a virtual
> > DMA operation from emulated devices or request from vhost devices.
> > It's not affected by passthrough device activities though, since the latter
> > goes through physical iommu. Per iommu spec, guest iommu driver
> > first clears the viommu page table, followed by viotlb invalidation
> > request. It's the latter being trapped by Qemu, then vfio is notified
> > at that point, where iova->gpa translation will simply fail since no
> > valid mapping in viommu page table and very likely no hit in viotlb.
> > To fix this gap, we need extend Qemu to cache all the valid iova
> > mappings in viommu page table, similar to how vfio does.
> >
> > 2) Then there will be parallel log_sync requests on each vfio device.
> > One is from the vcpu thread, when iotlb invalidation request is being
> > emulated. The other is from the migration thread, where log_sync is
> > requested for the entire guest memory in iterative copies. The
> > contention among multiple vCPU threads is already protected through
> > iommu lock, but we didn't find such thing between migration thread
> > and vcpu threads. Maybe we overlooked something, but ideally the
> > whole iova address space should be locked when the migration thread
> > is doing mega-sync/translation.
> >
> > +Yi and Peter for their opinions.
> 
> Good points, not sure I have anything to add to that.  As above, we can
> think about whether the device or the container is the right place to
> do dirty page tracking.

Sure. And above two points are orthogonal to whether doing it
in container or device level. We haven't got time to continue the
investigation since last post. will resume after KVM forum.

> 
> > > If GPAs are still needed, what are they for?  The write-protect example
> > > is clearly a hypervisor level interaction as I assume it's write
> > > protection relative to the vCPU.  It's a hypervisor specific interface
> > > to perform that write-protection, so why wouldn't we use a hypervisor
> > > specific interface to collect the data to perform that operation?
> > > IOW, if GVT-g already has a KVM dependency, why be concerned about
> > > adding another GVT-g KVM dependency?  It seems like vfio is just a
> >
> > This is possibly the way that we have to go, based on discussions
> > so far. Earlier I just hold the same argument as you emphasized
> > for vfio - although there are existing KVM dependencies, we want
> > minimize it. :-) Another worry is what if other vendor drivers may
> > have similar requirements, then can we invent some generic ways
> > thus avoid pushing them to do same tricky thing again. Of course,
> > we may revisit it later until this issue does become a common
> > requirement.
> 
> I know we don't see quite the same on the degree to which vfio has KVM
> dependencies, but it's possible, and regularly done, to use vfio without
> KVM.  The same cannot be said of GVT-g.  Therefore it's not very
> motivating for me to entertain a proposal for a new vfio interface
> whose main purpose is to fill in a gap in the already existing GVT-g
> dependency on KVM (and in no way reduce that existing KVM dependency).
> I maintain vfio's lack of dependencies on KVM either way ;)

Understand.

> 
> > > potentially convenient channel, but as discussed above, vfio has no
> > > business in GPAs because devices don't operate on GPAs and I've not
> > > been sold that there's value in vfio getting involved in that
> > > address space.  Convince me otherwise ;)  Thanks,
> > >
> >
> > Looks none of my arguments is convincible to you :-), so we move
> > to investigate what should be changed in qemu to support your
> > proposal (as discussed above). While this part is on-going, let me
> > have a last try on my original idea. ;) Just be curious how your
> > further thought is, regarding to earlier doorbell monitoring
> > example for operating GPA on device. If it's just Intel GPU only
> > thing, yes we can still fix it in GVT-g itself as you suggested. But
> > I'm just not sure about other integrated devices, and also new
> > accelerators with coherent bus connected to cpu package. Also we
> > don't need call it GPA - it could be named as user_target_address
> > that the iova is mapped to, and is the address space that userspace
> > expects the device to operate for purposes (logging, monitoring,
> > etc.) other than for dma (using iova) and for accessing
> > userspace/guest memory (hva).
> 
> My impression of the doorbell example is that we're taking the wrong
> approach to solving how that interaction should work.  To setup an
> interaction between a device and a vCPU, we should be going through
> QEMU.  This is the model we've used from the start with vfio.  KVM is
> an accelerator, but it's up to QEMU to connect KVM and vfio together.
> I'm curious if there's any merit to my proposal above about how this
> could be redesigned.  Thanks,
> 

There are merits for sure. We will explore that direction and see how
it works. Even if still some extension required between QEMU and KVM
(in case of functionality or efficiency concern), that might be better
fitted than extending VFIO.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-10-24 12:45 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface Kirti Wankhede
2019-08-28 20:50   ` Alex Williamson
2019-08-30  7:25     ` Tian, Kevin
2019-08-30 16:15       ` Alex Williamson
2019-09-03  6:05         ` Tian, Kevin
2019-09-04  8:28           ` Yan Zhao
     [not found]     ` <AADFC41AFE54684AB9EE6CBC0274A5D19D553133@SHSMSX104.ccr.corp.intel.com>
2019-08-30  8:06       ` Tian, Kevin
2019-08-30 16:32         ` Alex Williamson
2019-09-03  6:57           ` Tian, Kevin
2019-09-12 14:41             ` Alex Williamson
2019-09-12 23:00               ` Tian, Kevin
2019-09-13 15:47                 ` Alex Williamson
2019-09-16  1:53                   ` Tian, Kevin
     [not found]               ` <AADFC41AFE54684AB9EE6CBC0274A5D19D572142@SHSMSX104.ccr.corp.intel.com>
2019-09-24  2:19                 ` Tian, Kevin
2019-09-24 18:03                   ` Alex Williamson
2019-09-24 23:04                     ` Tian, Kevin
2019-09-25 19:06                       ` Alex Williamson
2019-09-26  3:07                         ` Tian, Kevin
2019-09-26 21:33                           ` Alex Williamson
2019-10-24 11:41                             ` Tian, Kevin
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 07/13] vfio: Add migration state change notifier Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 08/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 10/13] vfio: Add load " Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 11/13] vfio: Add function to get dirty page list Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 12/13] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 13/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
2019-08-26 19:43 ` [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.