QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v16 QEMU 00/16] Add migration support for VFIO devices 
@ 2020-03-24 21:08 Kirti Wankhede
  2020-03-24 21:08 ` [PATCH v16 QEMU 01/16] vfio: KABI for migration interface - Kernel header placeholder Kirti Wankhede
                   ` (17 more replies)
  0 siblings, 18 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:08 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Hi,

This Patch set adds migration support for VFIO devices in QEMU.

This Patch set include patches as below:
Patch 1:
- Define KABI for VFIO device for migration support for device state and newly
  added ioctl definations to get dirty pages bitmap. This is a placeholder
  patch.

Patch 2-4:
- Few code refactor
- Added save and restore functions for PCI configuration space

Patch 5-10:
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.

Patch 11-12
- Add helper function for migration with vIOMMU enabled to get address limit
  IOMMU supports.
- Set DIRTY_MEMORY_MIGRATION flag in dirty log mask for migration with vIOMMU
  enabled.

Patch 13-14:
- Add function to start and stop dirty pages tracking.
- Add vfio_listerner_log_sync to mark dirty pages. Dirty pages bitmap is queried
  per container. All pages pinned by vendor driver through vfio_pin_pages
  external API has to be marked as dirty during  migration.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned by vendor driver
  should be considered as dirty.
  In Qemu, marking pages dirty is only done when device is in stop-and-copy
  phase because if pages are marked dirty during pre-copy phase and content is
  transfered from source to distination, there is no way to know newly dirtied
  pages from the point they were copied earlier until device stops. To avoid
  repeated copy of same content, pinned pages are marked dirty only during
  stop-and-copy phase.

Patch 15:
- With vIOMMU, IO virtual address range can get unmapped while in pre-copy
  phase of migration. In that case, unmap ioctl should return pages pinned
  in that range and QEMU should report corresponding guest physical pages
  dirty.

Patch 16:
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required in kernel such that
vendor driver could report dirty pages to VFIO module during migration phases.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    On migration completion, vCPUs stops and calls .save_live_complete_precopy
    for each active device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, _STOPPED)
                        |
    (FINISH_MIGRATE, _COMPLETED, STOPPED)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, STOPPED)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
                        |
        (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.

v9 -> v16
- KABI almost finalised on kernel patches.
- Added support for migration with vIOMMU enabled.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU sets.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load as
  suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti



Kirti Wankhede (16):
  vfio: KABI for migration interface - Kernel header placeholder
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add VM state change handler to know state of VM
  vfio: Add migration state change notifier
  vfio: Register SaveVMHandlers for VFIO device
  vfio: Add save state functions to SaveVMHandlers
  vfio: Add load state functions to SaveVMHandlers
  iommu: add callback to get address limit IOMMU supports
  memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  vfio: Add function to start and stop dirty pages tracking
  vfio: Add vfio_listener_log_sync to mark dirty pages
  vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  vfio: Make vfio-pci device migration capable

 hw/i386/intel_iommu.c         |   9 +
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              | 303 +++++++++++++++-
 hw/vfio/migration.c           | 788 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 203 +++++++++--
 hw/vfio/pci.h                 |   1 -
 hw/vfio/trace-events          |  19 +
 include/exec/memory.h         |  19 +
 include/hw/vfio/vfio-common.h |  20 ++
 linux-headers/linux/vfio.h    | 297 +++++++++++++++-
 memory.c                      |  13 +-
 11 files changed, 1639 insertions(+), 35 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 01/16] vfio: KABI for migration interface - Kernel header placeholder
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
@ 2020-03-24 21:08 ` Kirti Wankhede
  2020-03-24 21:09 ` [PATCH v16 QEMU 02/16] vfio: Add function to unmap VFIO region Kirti Wankhede
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:08 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Kernel header patches are being reviewed along with kernel side changes.
This patch is only for place holder.
Link to Kernel patch set:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07429.html

This patch include all changes in vfio.h from above patch set

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 297 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 295 insertions(+), 2 deletions(-)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index fb10370d2928..78cadee85ac6 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
+#define VFIO_REGION_TYPE_MIGRATION              (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -379,6 +380,232 @@ struct vfio_region_gfx_edid {
 /* sub-types for VFIO_REGION_TYPE_CCW */
 #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
 
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
+
+/*
+ * The structure vfio_device_migration_info is placed at the 0th offset of
+ * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
+ * migration information. Field accesses from this structure are only supported
+ * at their native width and alignment. Otherwise, the result is undefined and
+ * vendor drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      - The user application writes to this field to inform the vendor driver
+ *        about the device state to be transitioned to.
+ *      - The vendor driver should take the necessary actions to change the
+ *        device state. After successful transition to a given state, the
+ *        vendor driver should return success on write(device_state, state)
+ *        system call. If the device state transition fails, the vendor driver
+ *        should return an appropriate -errno for the fault condition.
+ *      - On the user application side, if the device state transition fails,
+ *	  that is, if write(device_state, state) returns an error, read
+ *	  device_state again to determine the current state of the device from
+ *	  the vendor driver.
+ *      - The vendor driver should return previous state of the device unless
+ *        the vendor driver has encountered an internal error, in which case
+ *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
+ *      - The user application must use the device reset ioctl to recover the
+ *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
+ *        indicated to be in a valid device state by reading device_state, the
+ *        user application may attempt to transition the device to any valid
+ *        state reachable from the current state or terminate itself.
+ *
+ *      device_state consists of 3 bits:
+ *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
+ *        it indicates the _STOP state. When the device state is changed to
+ *        _STOP, driver should stop the device before write() returns.
+ *      - If bit 1 is set, it indicates the _SAVING state, which means that the
+ *        driver should start gathering device state information that will be
+ *        provided to the VFIO user application to save the device's state.
+ *      - If bit 2 is set, it indicates the _RESUMING state, which means that
+ *        the driver should prepare to resume the device. Data provided through
+ *        the migration region should be used to resume the device.
+ *      Bits 3 - 31 are reserved for future use. To preserve them, the user
+ *      application should perform a read-modify-write operation on this
+ *      field when modifying the specified bits.
+ *
+ *  +------- _RESUMING
+ *  |+------ _SAVING
+ *  ||+----- _RUNNING
+ *  |||
+ *  000b => Device Stopped, not saving or resuming
+ *  001b => Device running, which is the default state
+ *  010b => Stop the device & save the device state, stop-and-copy state
+ *  011b => Device running and save the device state, pre-copy state
+ *  100b => Device stopped and the device state is resuming
+ *  101b => Invalid state
+ *  110b => Error state
+ *  111b => Invalid state
+ *
+ * State transitions:
+ *
+ *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
+ *                (100b)     (001b)     (011b)        (010b)       (000b)
+ * 0. Running or default state
+ *                             |
+ *
+ * 1. Normal Shutdown (optional)
+ *                             |------------------------------------->|
+ *
+ * 2. Save the state or suspend
+ *                             |------------------------->|---------->|
+ *
+ * 3. Save the state during live migration
+ *                             |----------->|------------>|---------->|
+ *
+ * 4. Resuming
+ *                  |<---------|
+ *
+ * 5. Resumed
+ *                  |--------->|
+ *
+ * 0. Default state of VFIO device is _RUNNNG when the user application starts.
+ * 1. During normal shutdown of the user application, the user application may
+ *    optionally change the VFIO device state from _RUNNING to _STOP. This
+ *    transition is optional. The vendor driver must support this transition but
+ *    must not require it.
+ * 2. When the user application saves state or suspends the application, the
+ *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
+ *    On state transition from _RUNNING to stop-and-copy, driver must stop the
+ *    device, save the device state and send it to the application through the
+ *    migration region. The sequence to be followed for such transition is given
+ *    below.
+ * 3. In live migration of user application, the state transitions from _RUNNING
+ *    to pre-copy, to stop-and-copy, and to _STOP.
+ *    On state transition from _RUNNING to pre-copy, the driver should start
+ *    gathering the device state while the application is still running and send
+ *    the device state data to application through the migration region.
+ *    On state transition from pre-copy to stop-and-copy, the driver must stop
+ *    the device, save the device state and send it to the user application
+ *    through the migration region.
+ *    Vendor drivers must support the pre-copy state even for implementations
+ *    where no data is provided to the user before the stop-and-copy state. The
+ *    user must not be required to consume all migration data before the device
+ *    transitions to a new state, including the stop-and-copy state.
+ *    The sequence to be followed for above two transitions is given below.
+ * 4. To start the resuming phase, the device state should be transitioned from
+ *    the _RUNNING to the _RESUMING state.
+ *    In the _RESUMING state, the driver should use the device state data
+ *    received through the migration region to resume the device.
+ * 5. After providing saved device data to the driver, the application should
+ *    change the state from _RESUMING to _RUNNING.
+ *
+ * reserved:
+ *      Reads on this field return zero and writes are ignored.
+ *
+ * pending_bytes: (read only)
+ *      The number of pending bytes still to be migrated from the vendor driver.
+ *
+ * data_offset: (read only)
+ *      The user application should read data_offset in the migration region
+ *      from where the user application should read the device data during the
+ *      _SAVING state or write the device data during the _RESUMING state. See
+ *      below for details of sequence to be followed.
+ *
+ * data_size: (read/write)
+ *      The user application should read data_size to get the size in bytes of
+ *      the data copied in the migration region during the _SAVING state and
+ *      write the size in bytes of the data copied in the migration region
+ *      during the _RESUMING state.
+ *
+ * The format of the migration region is as follows:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^
+ *  offset 0-trapped part        data_offset
+ *
+ * The structure vfio_device_migration_info is always followed by the data
+ * section in the region, so data_offset will always be nonzero. The offset
+ * from where the data is copied is decided by the kernel driver. The data
+ * section can be trapped, mapped, or partitioned, depending on how the kernel
+ * driver defines the data section. The data section partition can be defined
+ * as mapped by the sparse mmap capability. If mmapped, data_offset should be
+ * page aligned, whereas initial section which contains the
+ * vfio_device_migration_info structure, might not end at the offset, which is
+ * page aligned. The user is not required to access through mmap regardless
+ * of the capabilities of the region mmap.
+ * The vendor driver should determine whether and how to partition the data
+ * section. The vendor driver should return data_offset accordingly.
+ *
+ * The sequence to be followed for the _SAVING|_RUNNING device state or
+ * pre-copy phase and for the _SAVING device state or stop-and-copy phase is as
+ * follows:
+ * a. Read pending_bytes, indicating the start of a new iteration to get device
+ *    data. Repeated read on pending_bytes at this stage should have no side
+ *    effects.
+ *    If pending_bytes == 0, the user application should not iterate to get data
+ *    for that device.
+ *    If pending_bytes > 0, perform the following steps.
+ * b. Read data_offset, indicating that the vendor driver should make data
+ *    available through the data section. The vendor driver should return this
+ *    read operation only after data is available from (region + data_offset)
+ *    to (region + data_offset + data_size).
+ * c. Read data_size, which is the amount of data in bytes available through
+ *    the migration region.
+ *    Read on data_offset and data_size should return the offset and size of
+ *    the current buffer if the user application reads data_offset and
+ *    data_size more than once here.
+ * d. Read data_size bytes of data from (region + data_offset) from the
+ *    migration region.
+ * e. Process the data.
+ * f. Read pending_bytes, which indicates that the data from the previous
+ *    iteration has been read. If pending_bytes > 0, go to step b.
+ *
+ * If an error occurs during the above sequence, the vendor driver can return
+ * an error code for next read() or write() operation, which will terminate the
+ * loop. The user application should then take the next necessary action, for
+ * example, failing migration or terminating the user application.
+ *
+ * The user application can transition from the _SAVING|_RUNNING
+ * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
+ * number of pending bytes. The user application should iterate in _SAVING
+ * (stop-and-copy) until pending_bytes is 0.
+ *
+ * The sequence to be followed while _RESUMING device state is as follows:
+ * While data for this device is available, repeat the following steps:
+ * a. Read data_offset from where the user application should write data.
+ * b. Write migration data starting at the migration region + data_offset for
+ *    the length determined by data_size from the migration source.
+ * c. Write data_size, which indicates to the vendor driver that data is
+ *    written in the migration region. Vendor driver should apply the
+ *    user-provided migration region data to the device resume state.
+ *
+ * For the user application, data is opaque. The user application should write
+ * data in the same order as the data is received and the data should be of
+ * same transaction size at the source.
+ */
+
+struct vfio_device_migration_info {
+	__u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_STOP      (0)
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+				     VFIO_DEVICE_STATE_SAVING |  \
+				     VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_VALID(state) \
+	(state & VFIO_DEVICE_STATE_RESUMING ? \
+	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
+
+#define VFIO_DEVICE_STATE_IS_ERROR(state) \
+	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
+					      VFIO_DEVICE_STATE_RESUMING))
+
+#define VFIO_DEVICE_STATE_SET_ERROR(state) \
+	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
+					     VFIO_DEVICE_STATE_RESUMING)
+
+	__u32 reserved;
+	__u64 pending_bytes;
+	__u64 data_offset;
+	__u64 data_size;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
@@ -720,8 +947,9 @@ struct vfio_device_ioeventfd {
 struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
-#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
-#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
+#define VFIO_IOMMU_INFO_PGSIZES   (1 << 0) /* supported page sizes info */
+#define VFIO_IOMMU_INFO_CAPS      (1 << 1) /* Info supports caps */
+#define VFIO_IOMMU_INFO_DIRTY_PGS (1 << 2) /* supports dirty page tracking */
 	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
 };
@@ -768,6 +996,12 @@ struct vfio_iommu_type1_dma_map {
 
 #define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
 
+struct vfio_bitmap {
+    __u64        pgsize;           /* page size for bitmap */
+    __u64        size;             /* in bytes */
+    __u64        *data;     /* one bit per page */
+};
+
 /**
  * VFIO_IOMMU_UNMAP_DMA - _IOWR(VFIO_TYPE, VFIO_BASE + 14,
  *							struct vfio_dma_unmap)
@@ -777,12 +1011,23 @@ struct vfio_iommu_type1_dma_map {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
+ * before unmapping IO virtual addresses. When this flag is set, user must
+ * provide data[] as structure vfio_bitmap. User must allocate memory to get
+ * bitmap, clear the bitmap memory by setting zero and must set size of
+ * allocated memory in vfio_bitmap.size field. One bit in bitmap
+ * represents per page, page of user provided page size in 'pgsize',
+ * consecutively starting from iova offset. Bit set indicates page at that
+ * offset from iova is dirty. Bitmap of pages in the range of unmapped size is
+ * returned in vfio_bitmap.data
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
+        __u8    data[];
 };
 
 #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
@@ -794,6 +1039,54 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_DIRTY_PAGES - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
+ *                                     struct vfio_iommu_type1_dirty_bitmap)
+ * IOCTL is used for dirty pages tracking. Caller sets argsz, which is size of
+ * struct vfio_iommu_type1_dirty_bitmap. Caller set flag depend on which
+ * operation to perform, details as below:
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_START set, indicates
+ * migration is active and IOMMU module should track pages which are pinned and
+ * could be dirtied by device.
+ * Dirty pages are tracked until tracking is stopped by user application by
+ * setting VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP flag.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP set, indicates
+ * IOMMU should stop tracking pinned pages.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP flag set,
+ * IOCTL returns dirty pages bitmap for IOMMU container during migration for
+ * given IOVA range. User must provide data[] as the structure
+ * vfio_iommu_type1_dirty_bitmap_get through which user provides IOVA range and
+ * pgsize. This interface supports to get bitmap of smallest supported pgsize
+ * only and can be modified in future to get bitmap of specified pgsize.
+ * User must allocate memory for bitmap, zero the bitmap memory and set size
+ * of allocated memory in bitmap_size field. One bit is used to represent one
+ * page consecutively starting from iova offset. User should provide page size
+ * in 'pgsize'. Bit set in bitmap indicates page at that offset from iova is
+ * dirty. Caller must set argsz including size of structure
+ * vfio_iommu_type1_dirty_bitmap_get.
+ *
+ * Only one flag should be set at a time.
+ */
+struct vfio_iommu_type1_dirty_bitmap {
+	__u32        argsz;
+	__u32        flags;
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_START	(1 << 0)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP	(1 << 1)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP	(1 << 2)
+	__u8         data[];
+};
+
+struct vfio_iommu_type1_dirty_bitmap_get {
+    __u64              iova;	/* IO virtual address */
+    __u64              size;	/* Size of iova range */
+    struct vfio_bitmap bitmap;
+};
+
+#define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 02/16] vfio: Add function to unmap VFIO region
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
  2020-03-24 21:08 ` [PATCH v16 QEMU 01/16] vfio: KABI for migration interface - Kernel header placeholder Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-24 21:09 ` [PATCH v16 QEMU 03/16] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

This function will be used for migration region.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/common.c              | 20 ++++++++++++++++++++
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 22 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0c4..4a2f0d6a2233 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -983,6 +983,26 @@ int vfio_region_mmap(VFIORegion *region)
     return 0;
 }
 
+void vfio_region_unmap(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        trace_vfio_region_unmap(memory_region_name(&region->mmaps[i].mem),
+                                region->mmaps[i].offset,
+                                region->mmaps[i].offset +
+                                region->mmaps[i].size - 1);
+        memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
+        munmap(region->mmaps[i].mmap, region->mmaps[i].size);
+        object_unparent(OBJECT(&region->mmaps[i].mem));
+        region->mmaps[i].mmap = NULL;
+    }
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
     int i;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33ffd..8cdc27946cb8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -111,6 +111,7 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd564209ac71..8d7a0fbb1046 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                       int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 03/16] vfio: Add vfio_get_object callback to VFIODeviceOps
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
  2020-03-24 21:08 ` [PATCH v16 QEMU 01/16] vfio: KABI for migration interface - Kernel header placeholder Kirti Wankhede
  2020-03-24 21:09 ` [PATCH v16 QEMU 02/16] vfio: Add function to unmap VFIO region Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-24 21:09 ` [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Hook vfio_get_object callback for PCI devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/pci.c                 | 8 ++++++++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95129ac..6c77c12e44b9 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2407,10 +2407,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice *vbasedev)
     }
 }
 
+static Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+    return OBJECT(vdev);
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
+    .vfio_get_object = vfio_pci_get_object,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8d7a0fbb1046..74261feaeac9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -119,6 +119,7 @@ struct VFIODeviceOps {
     void (*vfio_compute_needs_reset)(VFIODevice *vdev);
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
+    Object *(*vfio_get_object)(VFIODevice *vdev);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (2 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 03/16] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-25 19:56   ` Alex Williamson
                     ` (2 more replies)
  2020-03-24 21:09 ` [PATCH v16 QEMU 05/16] vfio: Add migration region initialization and finalize function Kirti Wankhede
                   ` (13 subsequent siblings)
  17 siblings, 3 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

These functions save and restore PCI device specific data - config
space of PCI device.
Tested save and restore with MSI and MSIX type.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   2 +
 2 files changed, 165 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6c77c12e44b9..8deb11e87ef7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/blocker.h"
+#include "migration/qemu-file.h"
 
 #define TYPE_VFIO_PCI "vfio-pci"
 #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
@@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
     }
 }
 
+static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t addr;
+    uint32_t addr_lo, addr_hi = 0;
+
+    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
+    if (!bar->size) {
+        return 0;
+    }
+
+    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
+
+    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
+                                       PCI_BASE_ADDRESS_MEM_MASK);
+    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
+        addr_hi = pci_default_read_config(pdev,
+                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
+    }
+
+    addr = ((uint64_t)addr_hi << 32) | addr_lo;
+
+    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int vfio_bars_validate(VFIOPCIDevice *vdev)
+{
+    int i, ret;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        ret = vfio_bar_validate(vdev, i);
+        if (ret) {
+            error_report("vfio: BAR address %d validation failed", i);
+            return ret;
+        }
+    }
+    return 0;
+}
+
 static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
@@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
     return OBJECT(vdev);
 }
 
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint16_t pci_cmd;
+    int i;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar;
+
+        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+        qemu_put_be32(f, bar);
+    }
+
+    qemu_put_be32(f, vdev->interrupt);
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+
+        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                                            2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        qemu_put_be32(f, msi_addr_lo);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                             4);
+        }
+        qemu_put_be32(f, msi_addr_hi);
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+        qemu_put_be32(f, msi_data);
+    } else if (vdev->interrupt == VFIO_INT_MSIX) {
+        uint16_t offset;
+
+        /* save enable bit and maskall bit */
+        offset = pci_default_read_config(pdev,
+                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
+        qemu_put_be16(f, offset);
+        msix_save(pdev, f);
+    }
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    qemu_put_be16(f, pci_cmd);
+}
+
+static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t interrupt_type;
+    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+    uint16_t pci_cmd;
+    bool msi_64bit;
+    int i, ret;
+
+    /* retore pci bar configuration */
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar = qemu_get_be32(f);
+
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+    }
+
+    ret = vfio_bars_validate(vdev);
+    if (ret) {
+        return ret;
+    }
+
+    interrupt_type = qemu_get_be32(f);
+
+    if (interrupt_type == VFIO_INT_MSI) {
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+        msi_addr_lo = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo, 4);
+
+        msi_addr_hi = qemu_get_be32(f);
+        if (msi_64bit) {
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+        msi_data = qemu_get_be32(f);
+        vfio_pci_write_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                msi_data, 2);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
+    } else if (interrupt_type == VFIO_INT_MSIX) {
+        uint16_t offset = qemu_get_be16(f);
+
+        /* load enable bit and maskall bit */
+        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
+                              offset, 2);
+        msix_load(pdev, f);
+    }
+    pci_cmd = qemu_get_be16(f);
+    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
+    return 0;
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
     .vfio_get_object = vfio_pci_get_object,
+    .vfio_save_config = vfio_pci_save_config,
+    .vfio_load_config = vfio_pci_load_config,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 74261feaeac9..d69a7f3ae31e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
     Object *(*vfio_get_object)(VFIODevice *vdev);
+    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 05/16] vfio: Add migration region initialization and finalize function
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (3 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-26 17:52   ` Dr. David Alan Gilbert
  2020-03-24 21:09 ` [PATCH v16 QEMU 06/16] vfio: Add VM state change handler to know state of VM Kirti Wankhede
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

- Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
  patch series.
- VFIO device supports migration or not is decided based of migration region
  query. If migration region query is successful and migration region
  initialization is successful then migration is supported else migration is
  blocked.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 138 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   3 +
 include/hw/vfio/vfio-common.h |   9 +++
 4 files changed, 151 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index 9bb1c09e8477..8b296c889ed9 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,4 +1,4 @@
-obj-y += common.o spapr.o
+obj-y += common.o spapr.o migration.o
 obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_VFIO_PLATFORM) += platform.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..a078dcf1dd8f
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,138 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2019
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.size) {
+        vfio_region_exit(&migration->region);
+        vfio_region_finalize(&migration->region);
+    }
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    Object *obj = NULL;
+    int ret = -EINVAL;
+
+    if (!vbasedev->ops->vfio_get_object) {
+        return ret;
+    }
+
+    obj = vbasedev->ops->vfio_get_object(vbasedev);
+    if (!obj) {
+        return ret;
+    }
+
+    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
+                            "migration");
+    if (ret) {
+        error_report("%s: Failed to setup VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    if (!migration->region.size) {
+        ret = -EINVAL;
+        error_report("%s: Invalid region size of VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+                               struct vfio_region_info *info)
+{
+    int ret;
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+
+    ret = vfio_migration_region_init(vbasedev, info->index);
+    if (ret) {
+        error_report("%s: Failed to initialise migration region",
+                     vbasedev->name);
+        g_free(vbasedev->migration);
+        vbasedev->migration = NULL;
+        return ret;
+    }
+
+    return 0;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_region_info *info;
+    Error *local_err = NULL;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    ret = vfio_migration_init(vbasedev, info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    trace_vfio_migration_probe(vbasedev->name, info->index);
+    return 0;
+
+add_blocker:
+    error_setg(&vbasedev->migration_blocker,
+               "VFIO device doesn't support migration");
+    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vbasedev->migration_blocker);
+    }
+    return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    g_free(vbasedev->migration);
+}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8cdc27946cb8..191a726a1312 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
 vfio_display_edid_link_down(void) ""
 vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
+
+# migration.c
+vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d69a7f3ae31e..d4b268641173 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -57,6 +57,10 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    VFIORegion region;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -113,6 +117,8 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 06/16] vfio: Add VM state change handler to know state of VM
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (4 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 05/16] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-24 21:09 ` [PATCH v16 QEMU 07/16] vfio: Add migration state change notifier Kirti Wankhede
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

VM state change handler gets called on change in VM's state. This is used to set
VFIO device state to _RUNNING.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 87 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  2 +
 include/hw/vfio/vfio-common.h |  4 ++
 3 files changed, 93 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a078dcf1dd8f..af9443c275fb 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include <linux/vfio.h>
 
+#include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
@@ -74,6 +75,85 @@ err:
     return ret;
 }
 
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
+                                    uint32_t value)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint32_t device_state;
+    int ret;
+
+    ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("%s: Failed to read device state %d %s",
+                     vbasedev->name, ret, strerror(errno));
+        return ret;
+    }
+
+    device_state = (device_state & mask) | value;
+
+    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
+        return -EINVAL;
+    }
+
+    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("%s: Failed to set device state %d %s",
+                     vbasedev->name, ret, strerror(errno));
+
+        ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                device_state));
+        if (ret < 0) {
+            error_report("%s: On failure, failed to read device state %d %s",
+                    vbasedev->name, ret, strerror(errno));
+            return ret;
+        }
+
+        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
+            error_report("%s: Device is in error state 0x%x",
+                         vbasedev->name, device_state);
+            return -EFAULT;
+        }
+    }
+
+    vbasedev->device_state = device_state;
+    trace_vfio_migration_set_state(vbasedev->name, device_state);
+    return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running)) {
+        int ret;
+        uint32_t value = 0, mask = 0;
+
+        if (running) {
+            value = VFIO_DEVICE_STATE_RUNNING;
+            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
+                mask = ~VFIO_DEVICE_STATE_RESUMING;
+            }
+        } else {
+            mask = ~VFIO_DEVICE_STATE_RUNNING;
+        }
+
+        ret = vfio_migration_set_state(vbasedev, mask, value);
+        if (ret) {
+            error_report("%s: Failed to set device state 0x%x",
+                         vbasedev->name, value & mask);
+        }
+        vbasedev->vm_running = running;
+        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
+                                  value & mask);
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -90,6 +170,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
+
     return 0;
 }
 
@@ -128,6 +211,10 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+    }
+
     if (vbasedev->migration_blocker) {
         migrate_del_blocker(vbasedev->migration_blocker);
         error_free(vbasedev->migration_blocker);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 191a726a1312..3d15bacd031a 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d4b268641173..3d18eb146b33 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
+#include "sysemu/sysemu.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -119,6 +120,9 @@ typedef struct VFIODevice {
     unsigned int flags;
     VFIOMigration *migration;
     Error *migration_blocker;
+    VMChangeStateEntry *vm_state;
+    uint32_t device_state;
+    int vm_running;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 07/16] vfio: Add migration state change notifier
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (5 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 06/16] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-04-01 11:27   ` Dr. David Alan Gilbert
  2020-03-24 21:09 ` [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to vendor
driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 31 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index af9443c275fb..22ded9d28cf3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -154,6 +154,27 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
     }
 }
 
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
+
+    switch (s->state) {
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev,
+                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
+                      VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("%s: Failed to set state RUNNING", vbasedev->name);
+        }
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -173,6 +194,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
+
     return 0;
 }
 
@@ -211,6 +235,11 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+
+    if (vbasedev->migration_state.notify) {
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
     if (vbasedev->vm_state) {
         qemu_del_vm_change_state_handler(vbasedev->vm_state);
     }
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 3d15bacd031a..69503228f20e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,3 +148,4 @@ vfio_display_edid_write_error(void) ""
 vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 3d18eb146b33..28f55f66d019 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -123,6 +123,7 @@ typedef struct VFIODevice {
     VMChangeStateEntry *vm_state;
     uint32_t device_state;
     int vm_running;
+    Notifier migration_state;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (6 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 07/16] vfio: Add migration state change notifier Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-25 21:02   ` Alex Williamson
  2020-04-01 17:36   ` Dr. David Alan Gilbert
  2020-03-24 21:09 ` [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Define flags to be used as delimeter in migration file stream.
Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
region from these functions at source during saving or pre-copy phase.
Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |  2 ++
 2 files changed, 78 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 22ded9d28cf3..033f76526e49 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,6 +8,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
 #include <linux/vfio.h>
 
 #include "sysemu/runstate.h"
@@ -24,6 +25,17 @@
 #include "pci.h"
 #include "trace.h"
 
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -126,6 +138,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
     return 0;
 }
 
+/* ---------------------------------------------------------------------- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    if (migration->region.mmaps) {
+        qemu_mutex_lock_iothread();
+        ret = vfio_region_mmap(&migration->region);
+        qemu_mutex_unlock_iothread();
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.index,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state SAVING", vbasedev->name);
+        return ret;
+    }
+
+    /*
+     * Save migration region size. This is used to verify migration region size
+     * is greater than or equal to migration region size at destination
+     */
+    qemu_put_be64(f, migration->region.size);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_setup(vbasedev->name);
+    return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->region.mmaps) {
+        vfio_region_unmap(&migration->region);
+    }
+    trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+};
+
+/* ---------------------------------------------------------------------- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -191,6 +266,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    register_savevm_live("vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 69503228f20e..4bb43f18f315 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
+vfio_save_setup(char *name) " (%s)"
+vfio_save_cleanup(char *name) " (%s)"
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (7 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-25 22:03   ` Alex Williamson
  2020-05-09  5:31   ` Yan Zhao
  2020-03-24 21:09 ` [PATCH v16 QEMU 10/16] vfio: Add load " Kirti Wankhede
                   ` (8 subsequent siblings)
  17 siblings, 2 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes. If pending_bytes > 0, go through below steps.
- read data_offset - indicates kernel driver to write data to staging
  buffer.
- read data_size - amount of data in bytes written by vendor driver in
  migration region.
- read data_size bytes of data from data_offset in the migration region.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes. If pending_bytes > 0, go through below steps.
c. read data_offset - indicates kernel driver to write data to staging
   buffer.
d. read data_size - amount of data in bytes written by vendor driver in
   migration region.
e. read data_size bytes of data from data_offset in the migration region.
f. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
g. iterate through steps b to f while (pending_bytes > 0)
h. Write {VFIO_MIG_FLAG_END_OF_STATE}

When data region is mapped, its user's responsibility to read data from
data_offset of data_size before moving to next steps.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 245 +++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events          |   6 ++
 include/hw/vfio/vfio-common.h |   1 +
 3 files changed, 251 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 033f76526e49..ecbeed5182c2 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
     return 0;
 }
 
+static void *find_data_region(VFIORegion *region,
+                              uint64_t data_offset,
+                              uint64_t data_size)
+{
+    void *ptr = NULL;
+    int i;
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if ((data_offset >= region->mmaps[i].offset) &&
+            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
+            (data_size <= region->mmaps[i].size)) {
+            ptr = region->mmaps[i].mmap + (data_offset -
+                                           region->mmaps[i].offset);
+            break;
+        }
+    }
+    return ptr;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t data_offset = 0, data_size = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+    if (ret != sizeof(data_offset)) {
+        error_report("%s: Failed to get migration buffer data offset %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_size));
+    if (ret != sizeof(data_size)) {
+        error_report("%s: Failed to get migration buffer data size %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    if (data_size > 0) {
+        void *buf = NULL;
+        bool buffer_mmaped;
+
+        if (region->mmaps) {
+            buf = find_data_region(region, data_offset, data_size);
+        }
+
+        buffer_mmaped = (buf != NULL) ? true : false;
+
+        if (!buffer_mmaped) {
+            buf = g_try_malloc0(data_size);
+            if (!buf) {
+                error_report("%s: Error allocating buffer ", __func__);
+                return -ENOMEM;
+            }
+
+            ret = pread(vbasedev->fd, buf, data_size,
+                        region->fd_offset + data_offset);
+            if (ret != data_size) {
+                error_report("%s: Failed to get migration data %d",
+                             vbasedev->name, ret);
+                g_free(buf);
+                return -EINVAL;
+            }
+        }
+
+        qemu_put_be64(f, data_size);
+        qemu_put_buffer(f, buf, data_size);
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+    } else {
+        qemu_put_be64(f, data_size);
+    }
+
+    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+                           migration->pending_bytes);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return data_size;
+}
+
+static int vfio_update_pending(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t pending_bytes = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending_bytes));
+    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
+        error_report("%s: Failed to get pending bytes %d",
+                     vbasedev->name, ret);
+        migration->pending_bytes = 0;
+        return (ret < 0) ? ret : -EINVAL;
+    }
+
+    migration->pending_bytes = pending_bytes;
+    trace_vfio_update_pending(vbasedev->name, pending_bytes);
+    return 0;
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
+        vbasedev->ops->vfio_save_config(vbasedev, f);
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_device_config_state(vbasedev->name);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -154,7 +285,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         qemu_mutex_unlock_iothread();
         if (ret) {
             error_report("%s: Failed to mmap VFIO migration region %d: %s",
-                         vbasedev->name, migration->region.index,
+                         vbasedev->name, migration->region.nr,
                          strerror(-ret));
             return ret;
         }
@@ -194,9 +325,121 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return;
+    }
+
+    *res_precopy_only += migration->pending_bytes;
+
+    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
+                            *res_postcopy_only, *res_compatible);
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret, data_size;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+
+    data_size = vfio_save_buffer(f, vbasedev);
+
+    if (data_size < 0) {
+        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
+                     strerror(errno));
+        return data_size;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_iterate(vbasedev->name, data_size);
+    if (data_size == 0) {
+        /* indicates data finished, goto complete phase */
+        return 1;
+    }
+
+    return 0;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
+                                   VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state STOP and SAVING",
+                     vbasedev->name);
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    while (migration->pending_bytes > 0) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+        ret = vfio_save_buffer(f, vbasedev);
+        if (ret < 0) {
+            error_report("%s: Failed to save buffer", vbasedev->name);
+            return ret;
+        } else if (ret == 0) {
+            break;
+        }
+
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
+    if (ret) {
+        error_report("%s: Failed to set state STOPPED", vbasedev->name);
+        return ret;
+    }
+
+    trace_vfio_save_complete_precopy(vbasedev->name);
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 4bb43f18f315..bdf40ba368c7 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -151,3 +151,9 @@ vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_st
 vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
 vfio_save_setup(char *name) " (%s)"
 vfio_save_cleanup(char *name) " (%s)"
+vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
+vfio_update_pending(char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
+vfio_save_device_config_state(char *name) " (%s)"
+vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
+vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
+vfio_save_complete_precopy(char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 28f55f66d019..c78033e4149d 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -60,6 +60,7 @@ typedef struct VFIORegion {
 
 typedef struct VFIOMigration {
     VFIORegion region;
+    uint64_t pending_bytes;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 10/16] vfio: Add load state functions to SaveVMHandlers
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (8 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-25 22:36   ` Alex Williamson
  2020-04-01 18:58   ` Dr. David Alan Gilbert
  2020-03-24 21:09 ` [PATCH v16 QEMU 11/16] iommu: add callback to get address limit IOMMU supports Kirti Wankhede
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Sequence  during _RESUMING device state:
While data for this device is available, repeat below steps:
a. read data_offset from where user application should write data.
b. write data of data_size to migration region from data_offset.
c. write data_size which indicates vendor driver that data is written in
   staging buffer.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 179 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   3 +
 2 files changed, 182 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ecbeed5182c2..ab295d25620e 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -269,6 +269,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    uint64_t data;
+
+    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+        int ret;
+
+        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
+        if (ret) {
+            error_report("%s: Failed to load device config space",
+                         vbasedev->name);
+            return ret;
+        }
+    }
+
+    data = qemu_get_be64(f);
+    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("%s: Failed loading device config space, "
+                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+        return -EINVAL;
+    }
+
+    trace_vfio_load_device_config_state(vbasedev->name);
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -434,12 +461,164 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+
+    if (migration->region.mmaps) {
+        ret = vfio_region_mmap(&migration->region);
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.nr,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_RESUMING);
+    if (ret) {
+        error_report("%s: Failed to set state RESUMING", vbasedev->name);
+    }
+    return ret;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    vfio_save_cleanup(opaque);
+    return 0;
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+    uint64_t data, data_size;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+
+        trace_vfio_load_state(vbasedev->name, data);
+
+        switch (data) {
+        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
+        {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
+        {
+            uint64_t region_size = qemu_get_be64(f);
+
+            if (migration->region.size < region_size) {
+                error_report("%s: SETUP STATE: migration region too small, "
+                             "0x%"PRIx64 " < 0x%"PRIx64, vbasedev->name,
+                             migration->region.size, region_size);
+                return -EINVAL;
+            }
+
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return ret;
+            } else {
+                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
+                             vbasedev->name, data);
+                return -EINVAL;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE:
+        {
+            VFIORegion *region = &migration->region;
+            void *buf = NULL;
+            bool buffer_mmaped = false;
+            uint64_t data_offset = 0;
+
+            data_size = qemu_get_be64(f);
+            if (data_size == 0) {
+                break;
+            }
+
+            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                        region->fd_offset +
+                        offsetof(struct vfio_device_migration_info,
+                        data_offset));
+            if (ret != sizeof(data_offset)) {
+                error_report("%s:Failed to get migration buffer data offset %d",
+                             vbasedev->name, ret);
+                return -EINVAL;
+            }
+
+            if (region->mmaps) {
+                buf = find_data_region(region, data_offset, data_size);
+            }
+
+            buffer_mmaped = (buf != NULL) ? true : false;
+
+            if (!buffer_mmaped) {
+                buf = g_try_malloc0(data_size);
+                if (!buf) {
+                    error_report("%s: Error allocating buffer ", __func__);
+                    return -ENOMEM;
+                }
+            }
+
+            qemu_get_buffer(f, buf, data_size);
+
+            if (!buffer_mmaped) {
+                ret = pwrite(vbasedev->fd, buf, data_size,
+                             region->fd_offset + data_offset);
+                g_free(buf);
+
+                if (ret != data_size) {
+                    error_report("%s: Failed to set migration buffer %d",
+                                 vbasedev->name, ret);
+                    return -EINVAL;
+                }
+            }
+
+            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
+                         region->fd_offset +
+                       offsetof(struct vfio_device_migration_info, data_size));
+            if (ret != sizeof(data_size)) {
+                error_report("%s: Failed to set migration buffer data size %d",
+                             vbasedev->name, ret);
+                if (!buffer_mmaped) {
+                    g_free(buf);
+                }
+                return -EINVAL;
+            }
+
+            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
+                                              data_size);
+            break;
+        }
+        }
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+        data = qemu_get_be64(f);
+    }
+
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index bdf40ba368c7..ac065b559f4e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -157,3 +157,6 @@ vfio_save_device_config_state(char *name) " (%s)"
 vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
 vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(char *name) " (%s)"
+vfio_load_device_config_state(char *name) " (%s)"
+vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 11/16] iommu: add callback to get address limit IOMMU supports
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (9 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 10/16] vfio: Add load " Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-24 21:09 ` [PATCH v16 QEMU 12/16] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Add optional method to get address limit IOMMU supports

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/i386/intel_iommu.c |  9 +++++++++
 include/exec/memory.h | 19 +++++++++++++++++++
 memory.c              | 11 +++++++++++
 3 files changed, 39 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index df7ad254ac15..d0b88c20c31e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3577,6 +3577,14 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     return;
 }
 
+static hwaddr vtd_iommu_get_address_limit(IOMMUMemoryRegion *iommu_mr)
+{
+    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+
+    return VTD_ADDRESS_SIZE(s->aw_bits) - 1;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -3878,6 +3886,7 @@ static void vtd_iommu_memory_region_class_init(ObjectClass *klass,
     imrc->translate = vtd_iommu_translate;
     imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
     imrc->replay = vtd_iommu_replay;
+    imrc->get_address_limit = vtd_iommu_get_address_limit;
 }
 
 static const TypeInfo vtd_iommu_memory_region_info = {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1614d9a02c0c..f7d92bf6e6a9 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -355,6 +355,17 @@ typedef struct IOMMUMemoryRegionClass {
      * @iommu: the IOMMUMemoryRegion
      */
     int (*num_indexes)(IOMMUMemoryRegion *iommu);
+
+    /*
+     * Return address limit this IOMMU supports.
+     *
+     * Optional method: if this method is not provided, then
+     * memory_region_iommu_address_limit() will return the limit which input
+     * argument to this function.
+     *
+     * @iommu: the IOMMUMemoryRegion
+     */
+    hwaddr (*get_address_limit)(IOMMUMemoryRegion *iommu);
 } IOMMUMemoryRegionClass;
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -1364,6 +1375,14 @@ int memory_region_iommu_attrs_to_index(IOMMUMemoryRegion *iommu_mr,
 int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr);
 
 /**
+ * memory_region_iommu_get_address_limit : return the maximum address limit
+ * that this IOMMU supports.
+ *
+ * @iommu_mr: the memory region
+ */
+hwaddr memory_region_iommu_get_address_limit(IOMMUMemoryRegion *iommu_mr,
+                                             hwaddr limit);
+/**
  * memory_region_name: get a memory region's name
  *
  * Returns the string that was used to initialize the memory region.
diff --git a/memory.c b/memory.c
index 601b74990620..acb7546971c3 100644
--- a/memory.c
+++ b/memory.c
@@ -1887,6 +1887,17 @@ void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     }
 }
 
+hwaddr memory_region_iommu_get_address_limit(IOMMUMemoryRegion *iommu_mr,
+                                             hwaddr limit)
+{
+    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
+
+    if (imrc->get_address_limit) {
+        return imrc->get_address_limit(iommu_mr);
+    }
+    return limit;
+}
+
 void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
                                              IOMMUNotifier *n)
 {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 12/16] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (10 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 11/16] iommu: add callback to get address limit IOMMU supports Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-04-01 19:00   ` Dr. David Alan Gilbert
  2020-03-24 21:09 ` [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/memory.c b/memory.c
index acb7546971c3..285ca2ed6dd9 100644
--- a/memory.c
+++ b/memory.c
@@ -1788,7 +1788,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
 {
     uint8_t mask = mr->dirty_log_mask;
-    if (global_dirty_log && mr->ram_block) {
+    if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
         mask |= (1 << DIRTY_MEMORY_MIGRATION);
     }
     return mask;
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (11 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 12/16] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-26 19:10   ` Alex Williamson
  2020-04-01 19:03   ` Dr. David Alan Gilbert
  2020-03-24 21:09 ` [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
for VFIO devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ab295d25620e..1827b7cfb316 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -9,6 +9,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/main-loop.h"
+#include <sys/ioctl.h>
 #include <linux/vfio.h>
 
 #include "sysemu/runstate.h"
@@ -296,6 +297,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
+{
+    int ret;
+    VFIOContainer *container = vbasedev->group->container;
+    struct vfio_iommu_type1_dirty_bitmap dirty = {
+        .argsz = sizeof(dirty),
+    };
+
+    if (start) {
+        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
+            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+        } else {
+            return 0;
+        }
+    } else {
+            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+    if (ret) {
+        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                     dirty.flags, errno);
+    }
+    return ret;
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -330,6 +357,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
      */
     qemu_put_be64(f, migration->region.size);
 
+    ret = vfio_start_dirty_page_tracking(vbasedev, true);
+    if (ret) {
+        return ret;
+    }
+
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
     ret = qemu_file_get_error(f);
@@ -346,6 +378,8 @@ static void vfio_save_cleanup(void *opaque)
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
+    vfio_start_dirty_page_tracking(vbasedev, false);
+
     if (migration->region.mmaps) {
         vfio_region_unmap(&migration->region);
     }
@@ -669,6 +703,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
         if (ret) {
             error_report("%s: Failed to set state RUNNING", vbasedev->name);
         }
+
+        vfio_start_dirty_page_tracking(vbasedev, false);
     }
 }
 
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (12 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-25  2:19   ` Yan Zhao
                     ` (2 more replies)
  2020-03-24 21:09 ` [PATCH v16 QEMU 15/16] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
                   ` (3 subsequent siblings)
  17 siblings, 3 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

vfio_listener_log_sync gets list of dirty pages from container using
VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
devices are stopped and saving state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c     | 200 +++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/vfio/trace-events |   1 +
 2 files changed, 196 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4a2f0d6a2233..6d41e1ac5c2f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "hw/vfio/vfio.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
+#include "exec/ram_addr.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -38,6 +39,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
 };
 
 /*
+ * Device state interfaces
+ */
+
+static bool vfio_devices_are_stopped_and_saving(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+/*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
@@ -408,8 +432,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 }
 
 /* Called with rcu_read_lock held.  */
-static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
-                           bool *read_only)
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
+                               ram_addr_t *ram_addr, bool *read_only)
 {
     MemoryRegion *mr;
     hwaddr xlat;
@@ -440,9 +464,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
         return false;
     }
 
-    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
-    *read_only = !writable || mr->readonly;
+    if (vaddr) {
+        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
+    }
 
+    if (ram_addr) {
+        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
+    }
+
+    if (read_only) {
+        *read_only = !writable || mr->readonly;
+    }
     return true;
 }
 
@@ -467,7 +499,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     rcu_read_lock();
 
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
+        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
             goto out;
         }
         /*
@@ -813,9 +845,167 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static int vfio_get_dirty_bitmap(MemoryListener *listener,
+                                 MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOGuestIOMMU *giommu;
+    IOMMUTLBEntry iotlb;
+    hwaddr granularity, address_limit, iova;
+    int ret;
+
+    if (memory_region_is_iommu(section->mr)) {
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            if (MEMORY_REGION(giommu->iommu) == section->mr &&
+                giommu->n.start == section->offset_within_region) {
+                break;
+            }
+        }
+
+        if (!giommu) {
+            return -EINVAL;
+        }
+    }
+
+    if (memory_region_is_iommu(section->mr)) {
+        granularity = memory_region_iommu_get_min_page_size(giommu->iommu);
+
+        address_limit = MIN(int128_get64(section->size),
+                            memory_region_iommu_get_address_limit(giommu->iommu,
+                                                 int128_get64(section->size)));
+    } else {
+        granularity = memory_region_size(section->mr);
+        address_limit = int128_get64(section->size);
+    }
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+
+    RCU_READ_LOCK_GUARD();
+
+    while (iova < address_limit) {
+        struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+        struct vfio_iommu_type1_dirty_bitmap_get *range;
+        ram_addr_t start, pages;
+        uint64_t iova_xlat, size;
+
+        if (memory_region_is_iommu(section->mr)) {
+            iotlb = address_space_get_iotlb_entry(container->space->as, iova,
+                                                 true, MEMTXATTRS_UNSPECIFIED);
+            if ((iotlb.target_as == NULL) || (iotlb.addr_mask == 0)) {
+                if ((iova + granularity) < iova) {
+                    break;
+                }
+                iova += granularity;
+                continue;
+            }
+            iova_xlat = iotlb.iova + giommu->iommu_offset;
+            size = iotlb.addr_mask + 1;
+        } else {
+            iova_xlat = iova;
+            size = address_limit;
+        }
+
+        dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+        if (!dbitmap) {
+            return -ENOMEM;
+        }
+
+        dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+        dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+        range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
+        range->iova = iova_xlat;
+        range->size = size;
+
+        /*
+         * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+         * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
+         * TARGET_PAGE_SIZE.
+         */
+        range->bitmap.pgsize = TARGET_PAGE_SIZE;
+
+        /*
+         * Comment from kvm_physical_sync_dirty_bitmap() since same applies here
+         * XXX bad kernel interface alert
+         * For dirty bitmap, kernel allocates array of size aligned to
+         * bits-per-long.  But for case when the kernel is 64bits and
+         * the userspace is 32bits, userspace can't align to the same
+         * bits-per-long, since sizeof(long) is different between kernel
+         * and user space.  This way, userspace will provide buffer which
+         * may be 4 bytes less than the kernel will use, resulting in
+         * userspace memory corruption (which is not detectable by valgrind
+         * too, in most cases).
+         * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
+         * a hope that sizeof(long) won't become >8 any time soon.
+         */
+
+        pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
+        range->bitmap.size = ROUND_UP(pages, 64) / 8;
+        range->bitmap.data = g_malloc0(range->bitmap.size);
+        if (range->bitmap.data == NULL) {
+            error_report("Error allocating bitmap of size 0x%llx",
+                         range->bitmap.size);
+            ret = -ENOMEM;
+            goto err_out;
+        }
+
+        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+        if (ret) {
+            error_report("Failed to get dirty bitmap for iova: 0x%llx "
+                         "size: 0x%llx err: %d",
+                         range->iova, range->size, errno);
+            goto err_out;
+        }
+
+        if (memory_region_is_iommu(section->mr)) {
+            if (!vfio_get_xlat_addr(&iotlb, NULL, &start, NULL)) {
+                ret = -EINVAL;
+                goto err_out;
+            }
+        } else {
+            start = memory_region_get_ram_addr(section->mr) +
+                    section->offset_within_region + iova -
+                    TARGET_PAGE_ALIGN(section->offset_within_address_space);
+        }
+
+        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
+                                               start, pages);
+
+        trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
+                                    range->bitmap.size, start);
+err_out:
+        g_free(range->bitmap.data);
+        g_free(dbitmap);
+
+        if (ret) {
+            return ret;
+        }
+
+        if ((iova + size) < iova) {
+            break;
+        }
+
+        iova += size;
+    }
+
+    return 0;
+}
+
+static void vfio_listerner_log_sync(MemoryListener *listener,
+        MemoryRegionSection *section)
+{
+    if (vfio_listener_skipped_section(section)) {
+        return;
+    }
+
+    if (vfio_devices_are_stopped_and_saving()) {
+        vfio_get_dirty_bitmap(listener, section);
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index ac065b559f4e..bc8f35ee9356 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
 vfio_load_device_config_state(char *name) " (%s)"
 vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 15/16] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (13 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-24 21:09 ` [PATCH v16 QEMU 16/16] vfio: Make vfio-pci device migration capable Kirti Wankhede
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase
of migration. In that case, unmap ioctl should return pages pinned in that range
and QEMU should find its correcponding guest physical addresses and report
those dirty.

Note: This patch is not yet tested. I'm trying to see how I can test this code
path.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c              | 83 ++++++++++++++++++++++++++++++++++++++++---
 include/hw/vfio/vfio-common.h |  1 +
 2 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6d41e1ac5c2f..e0f91841bc82 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -311,11 +311,77 @@ static bool vfio_devices_are_stopped_and_saving(void)
     return true;
 }
 
+static bool vfio_devices_are_running_and_saving(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+                                 hwaddr iova, ram_addr_t size,
+                                 IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap *unmap;
+    struct vfio_bitmap *bitmap;
+    uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
+    int ret;
+
+    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+    if (!unmap) {
+        return -ENOMEM;
+    }
+
+    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+    bitmap = (struct vfio_bitmap *)&unmap->data;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
+     * TARGET_PAGE_SIZE.
+     */
+
+    bitmap->pgsize = TARGET_PAGE_SIZE;
+    bitmap->size = ROUND_UP(pages, 64) / 8;
+    bitmap->data = g_malloc0(bitmap->size);
+    if (!bitmap->data) {
+        error_report("UNMAP: Error allocating bitmap of size 0x%llx",
+                     bitmap->size);
+        g_free(unmap);
+        return -ENOMEM;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    if (!ret) {
+        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
+                iotlb->translated_addr, pages);
+    } else {
+        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %d", -errno);
+    }
+
+    g_free(bitmap->data);
+    g_free(unmap);
+    return ret;
+}
+
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size)
+                          hwaddr iova, ram_addr_t size,
+                          IOMMUTLBEntry *iotlb)
 {
     struct vfio_iommu_type1_dma_unmap unmap = {
         .argsz = sizeof(unmap),
@@ -324,6 +390,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
+    if (iotlb && container->dirty_pages_supported &&
+        vfio_devices_are_running_and_saving()) {
+        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+    }
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -371,7 +442,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
      * the VGA ROM space.
      */
     if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
+        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
          ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
         return 0;
     }
@@ -519,7 +590,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -822,7 +893,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 
     if (try_unmap) {
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
+        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -1479,6 +1550,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container->space = space;
     container->fd = fd;
     container->error = NULL;
+    container->dirty_pages_supported = false;
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
 
@@ -1509,6 +1581,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         }
         vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
         container->pgsizes = info.iova_pgsizes;
+        if (info.flags & VFIO_IOMMU_INFO_DIRTY_PGS) {
+            container->dirty_pages_supported = true;
+        }
         break;
     }
     case VFIO_SPAPR_TCE_v2_IOMMU:
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c78033e4149d..8ab741463d50 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -79,6 +79,7 @@ typedef struct VFIOContainer {
     unsigned iommu_type;
     Error *error;
     bool initialized;
+    bool dirty_pages_supported;
     unsigned long pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v16 QEMU 16/16] vfio: Make vfio-pci device migration capable
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (14 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 15/16] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
@ 2020-03-24 21:09 ` Kirti Wankhede
  2020-03-24 23:36 ` [PATCH v16 QEMU 00/16] Add migration support for VFIO devices no-reply
  2020-03-31 18:34 ` Alex Williamson
  17 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-03-24 21:09 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

If device is not failover primary device call vfio_migration_probe()
and vfio_migration_finalize() functions for vfio-pci device to enable
migration for vfio PCI device which support migration.
Removed vfio_pci_vmstate structure.
Removed migration blocker from VFIO PCI device specific structure and use
migration blocker from generic structure of  VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 32 +++++++++++---------------------
 hw/vfio/pci.h |  1 -
 2 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 8deb11e87ef7..c70f153d431a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2916,22 +2916,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         return;
     }
 
-    if (!pdev->failover_pair_id) {
-        error_setg(&vdev->migration_blocker,
-                "VFIO device doesn't support migration");
-        ret = migrate_add_blocker(vdev->migration_blocker, &err);
-        if (ret) {
-            error_propagate(errp, err);
-            error_free(vdev->migration_blocker);
-            vdev->migration_blocker = NULL;
-            return;
-        }
-    }
-
     vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = DEVICE(vdev);
+    vdev->vbasedev.device_state = 0;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3195,6 +3184,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    if (!pdev->failover_pair_id) {
+        ret = vfio_migration_probe(&vdev->vbasedev, errp);
+        if (ret) {
+            error_report("%s: Failed to setup for migration",
+                         vdev->vbasedev.name);
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3209,11 +3206,6 @@ out_teardown:
     vfio_bars_exit(vdev);
 error:
     error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-    if (vdev->migration_blocker) {
-        migrate_del_blocker(vdev->migration_blocker);
-        error_free(vdev->migration_blocker);
-        vdev->migration_blocker = NULL;
-    }
 }
 
 static void vfio_instance_finalize(Object *obj)
@@ -3225,10 +3217,7 @@ static void vfio_instance_finalize(Object *obj)
     vfio_bars_finalize(vdev);
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
-    if (vdev->migration_blocker) {
-        migrate_del_blocker(vdev->migration_blocker);
-        error_free(vdev->migration_blocker);
-    }
+
     /*
      * XXX Leaking igd_opregion is not an oversight, we can't remove the
      * fw_cfg entry therefore leaking this allocation seems like the safest
@@ -3256,6 +3245,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 0da7a20a7ec2..b148c937ef72 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -168,7 +168,6 @@ typedef struct VFIOPCIDevice {
     bool no_vfio_ioeventfd;
     bool enable_ramfb;
     VFIODisplay *dpy;
-    Error *migration_blocker;
     Notifier irqchip_change_notifier;
 } VFIOPCIDevice;
 
-- 
2.7.0



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 00/16] Add migration support for VFIO devices 
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (15 preceding siblings ...)
  2020-03-24 21:09 ` [PATCH v16 QEMU 16/16] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2020-03-24 23:36 ` no-reply
  2020-03-31 18:34 ` Alex Williamson
  17 siblings, 0 replies; 74+ messages in thread
From: no-reply @ 2020-03-24 23:36 UTC (permalink / raw)
  To: kwankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, kwankhede,
	eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic, felipe,
	Ken.Xue, kevin.tian, yan.y.zhao, dgilbert, alex.williamson,
	changpeng.liu, cohuck, zhi.a.wang, jonathan.davies

Patchew URL: https://patchew.org/QEMU/1585084154-29461-1-git-send-email-kwankhede@nvidia.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      x86_64-softmmu/hw/vfio/pci-quirks.o
  CC      aarch64-softmmu/hw/intc/exynos4210_combiner.o
/tmp/qemu-test/src/hw/vfio/common.c: In function 'vfio_listerner_log_sync':
/tmp/qemu-test/src/hw/vfio/common.c:945:66: error: 'giommu' may be used uninitialized in this function [-Werror=maybe-uninitialized]
                             memory_region_iommu_get_address_limit(giommu->iommu,
                                                                  ^
/tmp/qemu-test/src/hw/vfio/common.c:923:21: note: 'giommu' was declared here
     VFIOGuestIOMMU *giommu;
                     ^
cc1: all warnings being treated as errors
make[1]: *** [hw/vfio/common.o] Error 1
make[1]: *** Waiting for unfinished jobs....
  CC      aarch64-softmmu/hw/intc/omap_intc.o
  CC      aarch64-softmmu/hw/intc/bcm2835_ic.o
---
  CC      aarch64-softmmu/hw/vfio/amd-xgbe.o
  CC      aarch64-softmmu/hw/virtio/virtio.o
  CC      aarch64-softmmu/hw/virtio/vhost.o
make: *** [x86_64-softmmu/all] Error 2
make: *** Waiting for unfinished jobs....
  CC      aarch64-softmmu/hw/virtio/vhost-backend.o
  CC      aarch64-softmmu/hw/virtio/vhost-user.o
---
  CC      aarch64-softmmu/hw/virtio/virtio-iommu.o
  CC      aarch64-softmmu/hw/virtio/vhost-vsock.o
/tmp/qemu-test/src/hw/vfio/common.c: In function 'vfio_listerner_log_sync':
/tmp/qemu-test/src/hw/vfio/common.c:945:66: error: 'giommu' may be used uninitialized in this function [-Werror=maybe-uninitialized]
                             memory_region_iommu_get_address_limit(giommu->iommu,
                                                                  ^
/tmp/qemu-test/src/hw/vfio/common.c:923:21: note: 'giommu' was declared here
     VFIOGuestIOMMU *giommu;
                     ^
cc1: all warnings being treated as errors
make[1]: *** [hw/vfio/common.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make: *** [aarch64-softmmu/all] Error 2
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 664, in <module>
    sys.exit(main())
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=c9b01bcc7fc04e2d8f5e74bf460f0d7a', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-lne31pn7/src/docker-src.2020-03-24-19.33.46.14149:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=c9b01bcc7fc04e2d8f5e74bf460f0d7a
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-lne31pn7/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    3m5.634s
user    0m8.335s


The full log is available at
http://patchew.org/logs/1585084154-29461-1-git-send-email-kwankhede@nvidia.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-03-24 21:09 ` [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
@ 2020-03-25  2:19   ` Yan Zhao
  2020-03-26 19:46   ` Alex Williamson
  2020-04-01  5:50   ` Yan Zhao
  2 siblings, 0 replies; 74+ messages in thread
From: Yan Zhao @ 2020-03-25  2:19 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Wed, Mar 25, 2020 at 05:09:12AM +0800, Kirti Wankhede wrote:
> vfio_listener_log_sync gets list of dirty pages from container using
> VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
> devices are stopped and saving state.
> Return early for the RAM block section of mapped MMIO region.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c     | 200 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio/trace-events |   1 +
>  2 files changed, 196 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 4a2f0d6a2233..6d41e1ac5c2f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "hw/vfio/vfio.h"
>  #include "exec/address-spaces.h"
>  #include "exec/memory.h"
> +#include "exec/ram_addr.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
> @@ -38,6 +39,7 @@
>  #include "sysemu/reset.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> +#include "migration/migration.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
>  };
>  
>  /*
> + * Device state interfaces
> + */
> +
> +static bool vfio_devices_are_stopped_and_saving(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> +                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +                continue;
> +            } else {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
> +/*
>   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>   */
>  static int vfio_dma_unmap(VFIOContainer *container,
> @@ -408,8 +432,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  }
>  
>  /* Called with rcu_read_lock held.  */
> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> -                           bool *read_only)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> +                               ram_addr_t *ram_addr, bool *read_only)
>  {
>      MemoryRegion *mr;
>      hwaddr xlat;
> @@ -440,9 +464,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>          return false;
>      }
>  
> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -    *read_only = !writable || mr->readonly;
> +    if (vaddr) {
> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +    }
>  
> +    if (ram_addr) {
> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> +    }
> +
> +    if (read_only) {
> +        *read_only = !writable || mr->readonly;
> +    }
>      return true;
>  }
>  
> @@ -467,7 +499,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      rcu_read_lock();
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
>              goto out;
>          }
>          /*
> @@ -813,9 +845,167 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static int vfio_get_dirty_bitmap(MemoryListener *listener,
> +                                 MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    VFIOGuestIOMMU *giommu;
> +    IOMMUTLBEntry iotlb;
> +    hwaddr granularity, address_limit, iova;
> +    int ret;
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +            if (MEMORY_REGION(giommu->iommu) == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
> +                break;
> +            }
> +        }
> +
> +        if (!giommu) {
> +            return -EINVAL;
> +        }
> +    }
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        granularity = memory_region_iommu_get_min_page_size(giommu->iommu);
> +
> +        address_limit = MIN(int128_get64(section->size),
> +                            memory_region_iommu_get_address_limit(giommu->iommu,
> +                                                 int128_get64(section->size)));
> +    } else {
> +        granularity = memory_region_size(section->mr);
> +        address_limit = int128_get64(section->size);
> +    }
> +
> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +
> +    RCU_READ_LOCK_GUARD();
> +
> +    while (iova < address_limit) {
> +        struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> +        struct vfio_iommu_type1_dirty_bitmap_get *range;
> +        ram_addr_t start, pages;
> +        uint64_t iova_xlat, size;
> +
> +        if (memory_region_is_iommu(section->mr)) {
> +            iotlb = address_space_get_iotlb_entry(container->space->as, iova,
> +                                                 true, MEMTXATTRS_UNSPECIFIED);
> +            if ((iotlb.target_as == NULL) || (iotlb.addr_mask == 0)) {
> +                if ((iova + granularity) < iova) {
> +                    break;
> +                }
> +                iova += granularity;
> +                continue;
> +            }
> +            iova_xlat = iotlb.iova + giommu->iommu_offset;
> +            size = iotlb.addr_mask + 1;
> +        } else {
> +            iova_xlat = iova;
> +            size = address_limit;
> +        }
> +
> +        dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
> +        if (!dbitmap) {
> +            return -ENOMEM;
> +        }
> +
> +        dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> +        dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +        range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
> +        range->iova = iova_xlat;
> +        range->size = size;
> +
> +        /*
> +         * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> +         * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
> +         * TARGET_PAGE_SIZE.
> +         */
> +        range->bitmap.pgsize = TARGET_PAGE_SIZE;
> +
> +        /*
> +         * Comment from kvm_physical_sync_dirty_bitmap() since same applies here
> +         * XXX bad kernel interface alert
> +         * For dirty bitmap, kernel allocates array of size aligned to
> +         * bits-per-long.  But for case when the kernel is 64bits and
> +         * the userspace is 32bits, userspace can't align to the same
> +         * bits-per-long, since sizeof(long) is different between kernel
> +         * and user space.  This way, userspace will provide buffer which
> +         * may be 4 bytes less than the kernel will use, resulting in
> +         * userspace memory corruption (which is not detectable by valgrind
> +         * too, in most cases).
> +         * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
> +         * a hope that sizeof(long) won't become >8 any time soon.
> +         */
> +
> +        pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
> +        range->bitmap.size = ROUND_UP(pages, 64) / 8;
> +        range->bitmap.data = g_malloc0(range->bitmap.size);
> +        if (range->bitmap.data == NULL) {
> +            error_report("Error allocating bitmap of size 0x%llx",
> +                         range->bitmap.size);
> +            ret = -ENOMEM;
> +            goto err_out;
> +        }
> +
> +        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> +        if (ret) {
> +            error_report("Failed to get dirty bitmap for iova: 0x%llx "
> +                         "size: 0x%llx err: %d",
> +                         range->iova, range->size, errno);
> +            goto err_out;
> +        }
> +
> +        if (memory_region_is_iommu(section->mr)) {
> +            if (!vfio_get_xlat_addr(&iotlb, NULL, &start, NULL)) {
> +                ret = -EINVAL;
> +                goto err_out;
> +            }
> +        } else {
> +            start = memory_region_get_ram_addr(section->mr) +
> +                    section->offset_within_region + iova -
> +                    TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
> +                                               start, pages);
> +
> +        trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
> +                                    range->bitmap.size, start);
> +err_out:
> +        g_free(range->bitmap.data);
> +        g_free(dbitmap);
> +
> +        if (ret) {
> +            return ret;
> +        }
> +
> +        if ((iova + size) < iova) {
> +            break;
> +        }
> +
> +        iova += size;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_listerner_log_sync(MemoryListener *listener,
> +        MemoryRegionSection *section)
> +{
> +    if (vfio_listener_skipped_section(section)) {
> +        return;
> +    }
> +
> +    if (vfio_devices_are_stopped_and_saving()) {
only get dirty in stopped and saving is not right.

> +        vfio_get_dirty_bitmap(listener, section);
> +    }
> +}
> +
>  static const MemoryListener vfio_memory_listener = {
>      .region_add = vfio_listener_region_add,
>      .region_del = vfio_listener_region_del,
> +    .log_sync = vfio_listerner_log_sync,
>  };
>  
>  static void vfio_listener_release(VFIOContainer *container)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ac065b559f4e..bc8f35ee9356 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
>  vfio_load_device_config_state(char *name) " (%s)"
>  vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
>  vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-03-24 21:09 ` [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2020-03-25 19:56   ` Alex Williamson
  2020-03-26 17:29     ` Dr. David Alan Gilbert
  2020-05-04 23:18     ` Kirti Wankhede
  2020-03-26 17:46   ` Dr. David Alan Gilbert
  2020-04-07  4:10   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2 siblings, 2 replies; 74+ messages in thread
From: Alex Williamson @ 2020-03-25 19:56 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Wed, 25 Mar 2020 02:39:02 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> These functions save and restore PCI device specific data - config
> space of PCI device.
> Tested save and restore with MSI and MSIX type.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   2 +
>  2 files changed, 165 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 6c77c12e44b9..8deb11e87ef7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -41,6 +41,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "migration/qemu-file.h"
>  
>  #define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    VFIOBAR *bar = &vdev->bars[nr];
> +    uint64_t addr;
> +    uint32_t addr_lo, addr_hi = 0;
> +
> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> +    if (!bar->size) {
> +        return 0;
> +    }
> +
> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> +
> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> +                                       PCI_BASE_ADDRESS_MEM_MASK);

Nit, &= or combine with previous set.

> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +        addr_hi = pci_default_read_config(pdev,
> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> +    }
> +
> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;

Could we use a union?

> +
> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> +        return -EINVAL;
> +    }

What specifically are we validating here?  This should be true no
matter what we wrote to the BAR or else BAR emulation is broken.  The
bits that could make this unaligned are not implemented in the BAR.

> +
> +    return 0;
> +}
> +
> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> +{
> +    int i, ret;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        ret = vfio_bar_validate(vdev, i);
> +        if (ret) {
> +            error_report("vfio: BAR address %d validation failed", i);
> +            return ret;
> +        }
> +    }
> +    return 0;
> +}
> +
>  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>  {
>      VFIOBAR *bar = &vdev->bars[nr];
> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>      return OBJECT(vdev);
>  }
>  
> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint16_t pci_cmd;
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar;
> +
> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar);
> +    }
> +
> +    qemu_put_be32(f, vdev->interrupt);
> +    if (vdev->interrupt == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                                            2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        qemu_put_be32(f, msi_addr_lo);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                             4);
> +        }
> +        qemu_put_be32(f, msi_addr_hi);
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +        qemu_put_be32(f, msi_data);

Isn't the data field only a u16?

> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> +        uint16_t offset;
> +
> +        /* save enable bit and maskall bit */
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        qemu_put_be16(f, offset);
> +        msix_save(pdev, f);
> +    }
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    qemu_put_be16(f, pci_cmd);
> +}
> +
> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +    uint16_t pci_cmd;
> +    bool msi_64bit;
> +    int i, ret;
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = qemu_get_be32(f);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }
> +
> +    ret = vfio_bars_validate(vdev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +        msi_addr_lo = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        msi_addr_hi = qemu_get_be32(f);
> +        if (msi_64bit) {
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +        msi_data = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset = qemu_get_be16(f);
> +
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +        msix_load(pdev, f);
> +    }
> +    pci_cmd = qemu_get_be16(f);
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +    return 0;
> +}

It always seems like there should be a lot more state than this, and I
probably sound like a broken record because I ask every time, but maybe
that's a good indication that we (or at least I) need a comment
explaining why we only care about these.  For example, what if we
migrate a device in the D3 power state, don't we need to account for
the state stored in the PM capability or does the device wake up into
D0 auto-magically after migration?  I think we could repeat that
question for every capability that can be modified.  Even for the MSI/X
cases, the interrupt may not be active, but there could be state in
virtual config space that would be different on the target.  For
example, if we migrate with a device in INTx mode where the guest had
written vector fields on the source, but only writes the enable bit on
the target, can we seamlessly figure out the rest?  For other
capabilities, that state may represent config space changes written
through to the physical device and represent a functional difference on
the target.  Thanks,

Alex

> +
>  static VFIODeviceOps vfio_pci_ops = {
>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>      .vfio_eoi = vfio_intx_eoi,
>      .vfio_get_object = vfio_pci_get_object,
> +    .vfio_save_config = vfio_pci_save_config,
> +    .vfio_load_config = vfio_pci_load_config,
>  };
>  
>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 74261feaeac9..d69a7f3ae31e 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>      void (*vfio_eoi)(VFIODevice *vdev);
>      Object *(*vfio_get_object)(VFIODevice *vdev);
> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>  };
>  
>  typedef struct VFIOGroup {



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-03-24 21:09 ` [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2020-03-25 21:02   ` Alex Williamson
  2020-05-04 23:19     ` Kirti Wankhede
  2020-04-01 17:36   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2020-03-25 21:02 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Wed, 25 Mar 2020 02:39:06 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 78 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 22ded9d28cf3..033f76526e49 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,6 +8,7 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
> @@ -24,6 +25,17 @@
>  #include "pci.h"
>  #include "trace.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -126,6 +138,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region);
> +        qemu_mutex_unlock_iothread();
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.index,
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> +        return ret;
> +    }
> +
> +    /*
> +     * Save migration region size. This is used to verify migration region size
> +     * is greater than or equal to migration region size at destination
> +     */
> +    qemu_put_be64(f, migration->region.size);

Is this requirement supported by the uapi?  The vendor driver operates
within the migration region, but it has no requirement to use the full
extent of the region.  Shouldn't we instead insert the version string
from versioning API Yan proposed?  Is this were we might choose to use
an interface via the vfio API rather than sysfs if we had one?

> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.mmaps) {
> +        vfio_region_unmap(&migration->region);
> +    }
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -191,6 +266,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    register_savevm_live("vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 69503228f20e..4bb43f18f315 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
> +vfio_save_setup(char *name) " (%s)"
> +vfio_save_cleanup(char *name) " (%s)"



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-03-24 21:09 ` [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2020-03-25 22:03   ` Alex Williamson
  2020-05-04 23:18     ` Kirti Wankhede
  2020-05-09  5:31   ` Yan Zhao
  1 sibling, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2020-03-25 22:03 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Wed, 25 Mar 2020 02:39:07 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes. If pending_bytes > 0, go through below steps.
> - read data_offset - indicates kernel driver to write data to staging
>   buffer.
> - read data_size - amount of data in bytes written by vendor driver in
>   migration region.
> - read data_size bytes of data from data_offset in the migration region.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> c. read data_offset - indicates kernel driver to write data to staging
>    buffer.
> d. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> e. read data_size bytes of data from data_offset in the migration region.
> f. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f while (pending_bytes > 0)
> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from
> data_offset of data_size before moving to next steps.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 245 +++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/trace-events          |   6 ++
>  include/hw/vfio/vfio-common.h |   1 +
>  3 files changed, 251 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 033f76526e49..ecbeed5182c2 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +static void *find_data_region(VFIORegion *region,
> +                              uint64_t data_offset,
> +                              uint64_t data_size)
> +{
> +    void *ptr = NULL;
> +    int i;
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        if ((data_offset >= region->mmaps[i].offset) &&
> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
> +            (data_size <= region->mmaps[i].size)) {

(data_offset - region->mmaps[i].offset) can be non-zero, so this test
is invalid.  Additionally the uapi does not require that a give data
chunk fits exclusively within an mmap'd area, it may overlap one or
more mmap'd sections of the region, possibly with non-mmap'd areas
included.

> +            ptr = region->mmaps[i].mmap + (data_offset -
> +                                           region->mmaps[i].offset);
> +            break;
> +        }
> +    }
> +    return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t data_offset = 0, data_size = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret != sizeof(data_offset)) {
> +        error_report("%s: Failed to get migration buffer data offset %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret != sizeof(data_size)) {
> +        error_report("%s: Failed to get migration buffer data size %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    if (data_size > 0) {
> +        void *buf = NULL;
> +        bool buffer_mmaped;
> +
> +        if (region->mmaps) {
> +            buf = find_data_region(region, data_offset, data_size);
> +        }
> +
> +        buffer_mmaped = (buf != NULL) ? true : false;

The ternary is unnecessary, "? true : false" is redundant.

> +
> +        if (!buffer_mmaped) {
> +            buf = g_try_malloc0(data_size);

Why do we need zero'd memory?

> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                return -ENOMEM;
> +            }
> +
> +            ret = pread(vbasedev->fd, buf, data_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != data_size) {
> +                error_report("%s: Failed to get migration data %d",
> +                             vbasedev->name, ret);
> +                g_free(buf);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, data_size);
> +        qemu_put_buffer(f, buf, data_size);

This can segfault when mmap'd given the above assumptions about size
and layout.

> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +    } else {
> +        qemu_put_be64(f, data_size);

We insert a zero?  Couldn't we add the section header and end here and
skip it entirely?

> +    }
> +
> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +                           migration->pending_bytes);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return data_size;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> +        error_report("%s: Failed to get pending bytes %d",
> +                     vbasedev->name, ret);
> +        migration->pending_bytes = 0;
> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    trace_vfio_save_device_config_state(vbasedev->name);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -154,7 +285,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          qemu_mutex_unlock_iothread();
>          if (ret) {
>              error_report("%s: Failed to mmap VFIO migration region %d: %s",
> -                         vbasedev->name, migration->region.index,
> +                         vbasedev->name, migration->region.nr,
>                           strerror(-ret));
>              return ret;
>          }
> @@ -194,9 +325,121 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    *res_precopy_only += migration->pending_bytes;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> +                            *res_postcopy_only, *res_compatible);
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret, data_size;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    data_size = vfio_save_buffer(f, vbasedev);
> +
> +    if (data_size < 0) {
> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> +                     strerror(errno));
> +        return data_size;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> +    if (data_size == 0) {
> +        /* indicates data finished, goto complete phase */
> +        return 1;

But it's pending_bytes not data_size that indicates we're done.  How do
we get away with ignoring pending_bytes for the save_live_iterate phase?

> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOP and SAVING",
> +                     vbasedev->name);
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev);
> +        if (ret < 0) {
> +            error_report("%s: Failed to save buffer", vbasedev->name);
> +            return ret;
> +        } else if (ret == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
> +        return ret;
> +    }
> +
> +    trace_vfio_save_complete_precopy(vbasedev->name);
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 4bb43f18f315..bdf40ba368c7 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -151,3 +151,9 @@ vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_st
>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>  vfio_save_setup(char *name) " (%s)"
>  vfio_save_cleanup(char *name) " (%s)"
> +vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
> +vfio_update_pending(char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
> +vfio_save_device_config_state(char *name) " (%s)"
> +vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
> +vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
> +vfio_save_complete_precopy(char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 28f55f66d019..c78033e4149d 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -60,6 +60,7 @@ typedef struct VFIORegion {
>  
>  typedef struct VFIOMigration {
>      VFIORegion region;
> +    uint64_t pending_bytes;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 10/16] vfio: Add load state functions to SaveVMHandlers
  2020-03-24 21:09 ` [PATCH v16 QEMU 10/16] vfio: Add load " Kirti Wankhede
@ 2020-03-25 22:36   ` Alex Williamson
  2020-04-01 18:58   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 74+ messages in thread
From: Alex Williamson @ 2020-03-25 22:36 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Wed, 25 Mar 2020 02:39:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Sequence  during _RESUMING device state:
> While data for this device is available, repeat below steps:
> a. read data_offset from where user application should write data.
> b. write data of data_size to migration region from data_offset.
> c. write data_size which indicates vendor driver that data is written in
>    staging buffer.
> 
> For user, data is opaque. User should write data in the same order as
> received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 179 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |   3 +
>  2 files changed, 182 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ecbeed5182c2..ab295d25620e 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -269,6 +269,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    uint64_t data;
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> +        int ret;
> +
> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> +        if (ret) {
> +            error_report("%s: Failed to load device config space",
> +                         vbasedev->name);
> +            return ret;
> +        }
> +    }
> +
> +    data = qemu_get_be64(f);
> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("%s: Failed loading device config space, "
> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> +        return -EINVAL;
> +    }
> +
> +    trace_vfio_load_device_config_state(vbasedev->name);
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -434,12 +461,164 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      return ret;
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +
> +    if (migration->region.mmaps) {
> +        ret = vfio_region_mmap(&migration->region);
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_RESUMING);

This seems very prone to making an invalid state, like _RUNNING |
_RESUMING.  Why wouldn't we use ~VFIO_DEVICE_STATE_MASK here?

> +    if (ret) {
> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> +    }
> +    return ret;
> +}
> +
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    vfio_save_cleanup(opaque);
> +    return 0;
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +    uint64_t data, data_size;
> +
> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +
> +        trace_vfio_load_state(vbasedev->name, data);
> +
> +        switch (data) {
> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> +        {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> +        {
> +            uint64_t region_size = qemu_get_be64(f);
> +
> +            if (migration->region.size < region_size) {
> +                error_report("%s: SETUP STATE: migration region too small, "
> +                             "0x%"PRIx64 " < 0x%"PRIx64, vbasedev->name,
> +                             migration->region.size, region_size);
> +                return -EINVAL;
> +            }
> +
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return ret;
> +            } else {
> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
> +                             vbasedev->name, data);
> +                return -EINVAL;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> +        {
> +            VFIORegion *region = &migration->region;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +            uint64_t data_offset = 0;
> +
> +            data_size = qemu_get_be64(f);
> +            if (data_size == 0) {
> +                break;
> +            }
> +
> +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                        region->fd_offset +
> +                        offsetof(struct vfio_device_migration_info,
> +                        data_offset));
> +            if (ret != sizeof(data_offset)) {
> +                error_report("%s:Failed to get migration buffer data offset %d",
> +                             vbasedev->name, ret);
> +                return -EINVAL;
> +            }
> +
> +            if (region->mmaps) {
> +                buf = find_data_region(region, data_offset, data_size);
> +            }
> +
> +            buffer_mmaped = (buf != NULL) ? true : false;
> +
> +            if (!buffer_mmaped) {
> +                buf = g_try_malloc0(data_size);
> +                if (!buf) {
> +                    error_report("%s: Error allocating buffer ", __func__);
> +                    return -ENOMEM;
> +                }
> +            }
> +
> +            qemu_get_buffer(f, buf, data_size);
> +
> +            if (!buffer_mmaped) {
> +                ret = pwrite(vbasedev->fd, buf, data_size,
> +                             region->fd_offset + data_offset);
> +                g_free(buf);
> +
> +                if (ret != data_size) {
> +                    error_report("%s: Failed to set migration buffer %d",
> +                                 vbasedev->name, ret);
> +                    return -EINVAL;
> +                }
> +            }
> +
> +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
> +                         region->fd_offset +
> +                       offsetof(struct vfio_device_migration_info, data_size));
> +            if (ret != sizeof(data_size)) {
> +                error_report("%s: Failed to set migration buffer data size %d",
> +                             vbasedev->name, ret);
> +                if (!buffer_mmaped) {
> +                    g_free(buf);
> +                }
> +                return -EINVAL;
> +            }
> +
> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> +                                              data_size);
> +            break;
> +        }
> +        }
> +
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +        data = qemu_get_be64(f);
> +    }
> +
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
>      .save_live_pending = vfio_save_pending,
>      .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .load_state = vfio_load_state,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bdf40ba368c7..ac065b559f4e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -157,3 +157,6 @@ vfio_save_device_config_state(char *name) " (%s)"
>  vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>  vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
>  vfio_save_complete_precopy(char *name) " (%s)"
> +vfio_load_device_config_state(char *name) " (%s)"
> +vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
> +vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-03-25 19:56   ` Alex Williamson
@ 2020-03-26 17:29     ` Dr. David Alan Gilbert
  2020-03-26 17:38       ` Alex Williamson
  2020-05-04 23:18     ` Kirti Wankhede
  1 sibling, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-26 17:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, yan.y.zhao,
	changpeng.liu, Ken.Xue

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Wed, 25 Mar 2020 02:39:02 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > These functions save and restore PCI device specific data - config
> > space of PCI device.
> > Tested save and restore with MSI and MSIX type.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
> >  include/hw/vfio/vfio-common.h |   2 +
> >  2 files changed, 165 insertions(+)
> > 
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 6c77c12e44b9..8deb11e87ef7 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -41,6 +41,7 @@
> >  #include "trace.h"
> >  #include "qapi/error.h"
> >  #include "migration/blocker.h"
> > +#include "migration/qemu-file.h"
> >  
> >  #define TYPE_VFIO_PCI "vfio-pci"
> >  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> > @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> >      }
> >  }
> >  
> > +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> > +{
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    VFIOBAR *bar = &vdev->bars[nr];
> > +    uint64_t addr;
> > +    uint32_t addr_lo, addr_hi = 0;
> > +
> > +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> > +    if (!bar->size) {
> > +        return 0;
> > +    }
> > +
> > +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> > +
> > +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> > +                                       PCI_BASE_ADDRESS_MEM_MASK);
> 
> Nit, &= or combine with previous set.
> 
> > +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > +        addr_hi = pci_default_read_config(pdev,
> > +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> > +    }
> > +
> > +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
> 
> Could we use a union?
> 
> > +
> > +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> > +        return -EINVAL;
> > +    }
> 
> What specifically are we validating here?  This should be true no
> matter what we wrote to the BAR or else BAR emulation is broken.  The
> bits that could make this unaligned are not implemented in the BAR.

That I think is based on a comment I asked a few versions back.
Remember the value being checked here is a value loaded from the
migration stream; it could be garbage, so it's good to do whatever
checks you can.

> > +
> > +    return 0;
> > +}
> > +
> > +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> > +{
> > +    int i, ret;
> > +
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        ret = vfio_bar_validate(vdev, i);
> > +        if (ret) {
> > +            error_report("vfio: BAR address %d validation failed", i);
> > +            return ret;
> > +        }
> > +    }
> > +    return 0;
> > +}
> > +
> >  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
> >  {
> >      VFIOBAR *bar = &vdev->bars[nr];
> > @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >      return OBJECT(vdev);
> >  }
> >  
> > +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > +{
> > +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint16_t pci_cmd;
> > +    int i;
> > +
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        uint32_t bar;
> > +
> > +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > +        qemu_put_be32(f, bar);
> > +    }
> > +
> > +    qemu_put_be32(f, vdev->interrupt);
> > +    if (vdev->interrupt == VFIO_INT_MSI) {
> > +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > +        bool msi_64bit;
> > +
> > +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +                                            2);
> > +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > +
> > +        msi_addr_lo = pci_default_read_config(pdev,
> > +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > +        qemu_put_be32(f, msi_addr_lo);
> > +
> > +        if (msi_64bit) {
> > +            msi_addr_hi = pci_default_read_config(pdev,
> > +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                                             4);
> > +        }
> > +        qemu_put_be32(f, msi_addr_hi);
> > +
> > +        msi_data = pci_default_read_config(pdev,
> > +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +                2);
> > +        qemu_put_be32(f, msi_data);
> 
> Isn't the data field only a u16?
> 
> > +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> > +        uint16_t offset;
> > +
> > +        /* save enable bit and maskall bit */
> > +        offset = pci_default_read_config(pdev,
> > +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > +        qemu_put_be16(f, offset);
> > +        msix_save(pdev, f);
> > +    }
> > +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > +    qemu_put_be16(f, pci_cmd);
> > +}
> > +
> > +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > +{
> > +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t interrupt_type;
> > +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > +    uint16_t pci_cmd;
> > +    bool msi_64bit;
> > +    int i, ret;
> > +
> > +    /* retore pci bar configuration */
> > +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        uint32_t bar = qemu_get_be32(f);
> > +
> > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > +    }
> > +
> > +    ret = vfio_bars_validate(vdev);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    interrupt_type = qemu_get_be32(f);
> > +
> > +    if (interrupt_type == VFIO_INT_MSI) {
> > +        /* restore msi configuration */
> > +        msi_flags = pci_default_read_config(pdev,
> > +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > +
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> > +
> > +        msi_addr_lo = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> > +                              msi_addr_lo, 4);
> > +
> > +        msi_addr_hi = qemu_get_be32(f);
> > +        if (msi_64bit) {
> > +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                                  msi_addr_hi, 4);
> > +        }
> > +        msi_data = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev,
> > +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +                msi_data, 2);
> > +
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> > +    } else if (interrupt_type == VFIO_INT_MSIX) {
> > +        uint16_t offset = qemu_get_be16(f);
> > +
> > +        /* load enable bit and maskall bit */
> > +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> > +                              offset, 2);
> > +        msix_load(pdev, f);
> > +    }
> > +    pci_cmd = qemu_get_be16(f);
> > +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> > +    return 0;
> > +}
> 
> It always seems like there should be a lot more state than this, and I
> probably sound like a broken record because I ask every time, but maybe
> that's a good indication that we (or at least I) need a comment
> explaining why we only care about these.  For example, what if we
> migrate a device in the D3 power state, don't we need to account for
> the state stored in the PM capability or does the device wake up into
> D0 auto-magically after migration?  I think we could repeat that
> question for every capability that can be modified.  Even for the MSI/X
> cases, the interrupt may not be active, but there could be state in
> virtual config space that would be different on the target.  For
> example, if we migrate with a device in INTx mode where the guest had
> written vector fields on the source, but only writes the enable bit on
> the target, can we seamlessly figure out the rest?  For other
> capabilities, that state may represent config space changes written
> through to the physical device and represent a functional difference on
> the target.  Thanks,
> 
> Alex
> 
> > +
> >  static VFIODeviceOps vfio_pci_ops = {
> >      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
> >      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
> >      .vfio_eoi = vfio_intx_eoi,
> >      .vfio_get_object = vfio_pci_get_object,
> > +    .vfio_save_config = vfio_pci_save_config,
> > +    .vfio_load_config = vfio_pci_load_config,
> >  };
> >  
> >  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index 74261feaeac9..d69a7f3ae31e 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -120,6 +120,8 @@ struct VFIODeviceOps {
> >      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
> >      void (*vfio_eoi)(VFIODevice *vdev);
> >      Object *(*vfio_get_object)(VFIODevice *vdev);
> > +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> > +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
> >  };
> >  
> >  typedef struct VFIOGroup {
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-03-26 17:29     ` Dr. David Alan Gilbert
@ 2020-03-26 17:38       ` Alex Williamson
  0 siblings, 0 replies; 74+ messages in thread
From: Alex Williamson @ 2020-03-26 17:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, yan.y.zhao,
	changpeng.liu, Ken.Xue

On Thu, 26 Mar 2020 17:29:26 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Wed, 25 Mar 2020 02:39:02 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > These functions save and restore PCI device specific data - config
> > > space of PCI device.
> > > Tested save and restore with MSI and MSIX type.
> > > 
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > ---
> > >  hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
> > >  include/hw/vfio/vfio-common.h |   2 +
> > >  2 files changed, 165 insertions(+)
> > > 
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 6c77c12e44b9..8deb11e87ef7 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -41,6 +41,7 @@
> > >  #include "trace.h"
> > >  #include "qapi/error.h"
> > >  #include "migration/blocker.h"
> > > +#include "migration/qemu-file.h"
> > >  
> > >  #define TYPE_VFIO_PCI "vfio-pci"
> > >  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> > > @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> > >      }
> > >  }
> > >  
> > > +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> > > +{
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    VFIOBAR *bar = &vdev->bars[nr];
> > > +    uint64_t addr;
> > > +    uint32_t addr_lo, addr_hi = 0;
> > > +
> > > +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> > > +    if (!bar->size) {
> > > +        return 0;
> > > +    }
> > > +
> > > +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> > > +
> > > +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> > > +                                       PCI_BASE_ADDRESS_MEM_MASK);  
> > 
> > Nit, &= or combine with previous set.
> >   
> > > +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > > +        addr_hi = pci_default_read_config(pdev,
> > > +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> > > +    }
> > > +
> > > +    addr = ((uint64_t)addr_hi << 32) | addr_lo;  
> > 
> > Could we use a union?
> >   
> > > +
> > > +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> > > +        return -EINVAL;
> > > +    }  
> > 
> > What specifically are we validating here?  This should be true no
> > matter what we wrote to the BAR or else BAR emulation is broken.  The
> > bits that could make this unaligned are not implemented in the BAR.  
> 
> That I think is based on a comment I asked a few versions back.
> Remember the value being checked here is a value loaded from the
> migration stream; it could be garbage, so it's good to do whatever
> checks you can.

It's not the migration stream though, we're reading it from config
space emulation.  The migration stream could have written absolutely
anything to the device BAR and this test should still be ok.  PCI BARs
are naturally aligned by definition.  The address bits that could make
the value unaligned are not implemented.  This is why we can determine
the size of the BAR by writing -1 to it.  Thanks,

Alex

> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> > > +{
> > > +    int i, ret;
> > > +
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        ret = vfio_bar_validate(vdev, i);
> > > +        if (ret) {
> > > +            error_report("vfio: BAR address %d validation failed", i);
> > > +            return ret;
> > > +        }
> > > +    }
> > > +    return 0;
> > > +}
> > > +
> > >  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
> > >  {
> > >      VFIOBAR *bar = &vdev->bars[nr];
> > > @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> > >      return OBJECT(vdev);
> > >  }
> > >  
> > > +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > > +{
> > > +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    uint16_t pci_cmd;
> > > +    int i;
> > > +
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        uint32_t bar;
> > > +
> > > +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > > +        qemu_put_be32(f, bar);
> > > +    }
> > > +
> > > +    qemu_put_be32(f, vdev->interrupt);
> > > +    if (vdev->interrupt == VFIO_INT_MSI) {
> > > +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > +        bool msi_64bit;
> > > +
> > > +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +                                            2);
> > > +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > > +
> > > +        msi_addr_lo = pci_default_read_config(pdev,
> > > +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > > +        qemu_put_be32(f, msi_addr_lo);
> > > +
> > > +        if (msi_64bit) {
> > > +            msi_addr_hi = pci_default_read_config(pdev,
> > > +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > +                                             4);
> > > +        }
> > > +        qemu_put_be32(f, msi_addr_hi);
> > > +
> > > +        msi_data = pci_default_read_config(pdev,
> > > +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +                2);
> > > +        qemu_put_be32(f, msi_data);  
> > 
> > Isn't the data field only a u16?
> >   
> > > +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> > > +        uint16_t offset;
> > > +
> > > +        /* save enable bit and maskall bit */
> > > +        offset = pci_default_read_config(pdev,
> > > +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > > +        qemu_put_be16(f, offset);
> > > +        msix_save(pdev, f);
> > > +    }
> > > +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > +    qemu_put_be16(f, pci_cmd);
> > > +}
> > > +
> > > +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > > +{
> > > +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    uint32_t interrupt_type;
> > > +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > +    uint16_t pci_cmd;
> > > +    bool msi_64bit;
> > > +    int i, ret;
> > > +
> > > +    /* retore pci bar configuration */
> > > +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        uint32_t bar = qemu_get_be32(f);
> > > +
> > > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > > +    }
> > > +
> > > +    ret = vfio_bars_validate(vdev);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    interrupt_type = qemu_get_be32(f);
> > > +
> > > +    if (interrupt_type == VFIO_INT_MSI) {
> > > +        /* restore msi configuration */
> > > +        msi_flags = pci_default_read_config(pdev,
> > > +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > > +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > > +
> > > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> > > +
> > > +        msi_addr_lo = qemu_get_be32(f);
> > > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> > > +                              msi_addr_lo, 4);
> > > +
> > > +        msi_addr_hi = qemu_get_be32(f);
> > > +        if (msi_64bit) {
> > > +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > +                                  msi_addr_hi, 4);
> > > +        }
> > > +        msi_data = qemu_get_be32(f);
> > > +        vfio_pci_write_config(pdev,
> > > +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +                msi_data, 2);
> > > +
> > > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> > > +    } else if (interrupt_type == VFIO_INT_MSIX) {
> > > +        uint16_t offset = qemu_get_be16(f);
> > > +
> > > +        /* load enable bit and maskall bit */
> > > +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> > > +                              offset, 2);
> > > +        msix_load(pdev, f);
> > > +    }
> > > +    pci_cmd = qemu_get_be16(f);
> > > +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> > > +    return 0;
> > > +}  
> > 
> > It always seems like there should be a lot more state than this, and I
> > probably sound like a broken record because I ask every time, but maybe
> > that's a good indication that we (or at least I) need a comment
> > explaining why we only care about these.  For example, what if we
> > migrate a device in the D3 power state, don't we need to account for
> > the state stored in the PM capability or does the device wake up into
> > D0 auto-magically after migration?  I think we could repeat that
> > question for every capability that can be modified.  Even for the MSI/X
> > cases, the interrupt may not be active, but there could be state in
> > virtual config space that would be different on the target.  For
> > example, if we migrate with a device in INTx mode where the guest had
> > written vector fields on the source, but only writes the enable bit on
> > the target, can we seamlessly figure out the rest?  For other
> > capabilities, that state may represent config space changes written
> > through to the physical device and represent a functional difference on
> > the target.  Thanks,
> > 
> > Alex
> >   
> > > +
> > >  static VFIODeviceOps vfio_pci_ops = {
> > >      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
> > >      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
> > >      .vfio_eoi = vfio_intx_eoi,
> > >      .vfio_get_object = vfio_pci_get_object,
> > > +    .vfio_save_config = vfio_pci_save_config,
> > > +    .vfio_load_config = vfio_pci_load_config,
> > >  };
> > >  
> > >  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > > index 74261feaeac9..d69a7f3ae31e 100644
> > > --- a/include/hw/vfio/vfio-common.h
> > > +++ b/include/hw/vfio/vfio-common.h
> > > @@ -120,6 +120,8 @@ struct VFIODeviceOps {
> > >      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
> > >      void (*vfio_eoi)(VFIODevice *vdev);
> > >      Object *(*vfio_get_object)(VFIODevice *vdev);
> > > +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> > > +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
> > >  };
> > >  
> > >  typedef struct VFIOGroup {  
> >   
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-03-24 21:09 ` [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
  2020-03-25 19:56   ` Alex Williamson
@ 2020-03-26 17:46   ` Dr. David Alan Gilbert
  2020-05-04 23:19     ` Kirti Wankhede
  2020-04-07  4:10   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2 siblings, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-26 17:46 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> These functions save and restore PCI device specific data - config
> space of PCI device.
> Tested save and restore with MSI and MSIX type.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   2 +
>  2 files changed, 165 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 6c77c12e44b9..8deb11e87ef7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -41,6 +41,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "migration/qemu-file.h"
>  
>  #define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    VFIOBAR *bar = &vdev->bars[nr];
> +    uint64_t addr;
> +    uint32_t addr_lo, addr_hi = 0;
> +
> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> +    if (!bar->size) {
> +        return 0;
> +    }
> +
> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> +
> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> +                                       PCI_BASE_ADDRESS_MEM_MASK);
> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +        addr_hi = pci_default_read_config(pdev,
> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> +    }
> +
> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
> +
> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> +{
> +    int i, ret;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        ret = vfio_bar_validate(vdev, i);
> +        if (ret) {
> +            error_report("vfio: BAR address %d validation failed", i);
> +            return ret;
> +        }
> +    }
> +    return 0;
> +}
> +
>  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>  {
>      VFIOBAR *bar = &vdev->bars[nr];
> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>      return OBJECT(vdev);
>  }
>  
> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint16_t pci_cmd;
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar;
> +
> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar);
> +    }
> +
> +    qemu_put_be32(f, vdev->interrupt);
> +    if (vdev->interrupt == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                                            2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        qemu_put_be32(f, msi_addr_lo);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                             4);
> +        }
> +        qemu_put_be32(f, msi_addr_hi);
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +        qemu_put_be32(f, msi_data);
> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> +        uint16_t offset;
> +
> +        /* save enable bit and maskall bit */
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        qemu_put_be16(f, offset);
> +        msix_save(pdev, f);
> +    }
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    qemu_put_be16(f, pci_cmd);
> +}
> +
> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +    uint16_t pci_cmd;
> +    bool msi_64bit;
> +    int i, ret;
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = qemu_get_be32(f);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }
> +
> +    ret = vfio_bars_validate(vdev);

This isn't quite what I'd expected, since that validate is reading what
you read back; I'd have thought you'd validate the bar value before
writing it to the device.
(I'm also surprised you're only reading 32bit here?)

> +    if (ret) {
> +        return ret;
> +    }
> +
> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +        msi_addr_lo = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        msi_addr_hi = qemu_get_be32(f);
> +        if (msi_64bit) {
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +        msi_data = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset = qemu_get_be16(f);
> +
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +        msix_load(pdev, f);
> +    }
> +    pci_cmd = qemu_get_be16(f);
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +    return 0;
> +}
> +

While I don't know PCI as well as Alex, I share the worry about what
happens when you decide to want to save more information about the
device; you've not got any place holders where you can add anything; and
since it's all hand-coded (rather than using vmstate) it's only going to
get hairier.

Dave

>  static VFIODeviceOps vfio_pci_ops = {
>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>      .vfio_eoi = vfio_intx_eoi,
>      .vfio_get_object = vfio_pci_get_object,
> +    .vfio_save_config = vfio_pci_save_config,
> +    .vfio_load_config = vfio_pci_load_config,
>  };
>  
>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 74261feaeac9..d69a7f3ae31e 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>      void (*vfio_eoi)(VFIODevice *vdev);
>      Object *(*vfio_get_object)(VFIODevice *vdev);
> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>  };
>  
>  typedef struct VFIOGroup {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 05/16] vfio: Add migration region initialization and finalize function
  2020-03-24 21:09 ` [PATCH v16 QEMU 05/16] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2020-03-26 17:52   ` Dr. David Alan Gilbert
  2020-05-04 23:19     ` Kirti Wankhede
  0 siblings, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-26 17:52 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> - Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
>   patch series.
> - VFIO device supports migration or not is decided based of migration region
>   query. If migration region query is successful and migration region
>   initialization is successful then migration is supported else migration is
>   blocked.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 138 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   3 +
>  include/hw/vfio/vfio-common.h |   9 +++
>  4 files changed, 151 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index 9bb1c09e8477..8b296c889ed9 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,4 +1,4 @@
> -obj-y += common.o spapr.o
> +obj-y += common.o spapr.o migration.o
>  obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_VFIO_PLATFORM) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..a078dcf1dd8f
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,138 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2019

Time flies by...

> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.

Are you sure you want this to be V2 only? Most code added to qemu now is
v2 or later.

> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +#include "trace.h"
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.size) {
> +        vfio_region_exit(&migration->region);
> +        vfio_region_finalize(&migration->region);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!vbasedev->ops->vfio_get_object) {
> +        return ret;
> +    }
> +
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
> +                            "migration");
> +    if (ret) {
> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.size) {
> +        ret = -EINVAL;
> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_init(VFIODevice *vbasedev,
> +                               struct vfio_region_info *info)
> +{
> +    int ret;
> +
> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> +
> +    ret = vfio_migration_region_init(vbasedev, info->index);
> +    if (ret) {
> +        error_report("%s: Failed to initialise migration region",
> +                     vbasedev->name);
> +        g_free(vbasedev->migration);
> +        vbasedev->migration = NULL;
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +/* ---------------------------------------------------------------------- */
> +
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> +{
> +    struct vfio_region_info *info;
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    if (ret) {
> +        goto add_blocker;
> +    }
> +
> +    ret = vfio_migration_init(vbasedev, info);
> +    if (ret) {
> +        goto add_blocker;
> +    }
> +
> +    trace_vfio_migration_probe(vbasedev->name, info->index);
> +    return 0;
> +
> +add_blocker:
> +    error_setg(&vbasedev->migration_blocker,
> +               "VFIO device doesn't support migration");
> +    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +    return ret;
> +}
> +
> +void vfio_migration_finalize(VFIODevice *vbasedev)
> +{
> +    if (vbasedev->migration_blocker) {
> +        migrate_del_blocker(vbasedev->migration_blocker);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +
> +    vfio_migration_region_exit(vbasedev);
> +    g_free(vbasedev->migration);
> +}
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 8cdc27946cb8..191a726a1312 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
>  vfio_display_edid_link_down(void) ""
>  vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>  vfio_display_edid_write_error(void) ""
> +
> +# migration.c
> +vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index d69a7f3ae31e..d4b268641173 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -57,6 +57,10 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */
>  } VFIORegion;
>  
> +typedef struct VFIOMigration {
> +    VFIORegion region;
> +} VFIOMigration;
> +
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers;
> @@ -113,6 +117,8 @@ typedef struct VFIODevice {
>      unsigned int num_irqs;
>      unsigned int num_regions;
>      unsigned int flags;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> @@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>  int vfio_spapr_remove_window(VFIOContainer *container,
>                               hwaddr offset_within_address_space);
>  
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> +void vfio_migration_finalize(VFIODevice *vbasedev);
> +
>  #endif /* HW_VFIO_VFIO_COMMON_H */
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking
  2020-03-24 21:09 ` [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
@ 2020-03-26 19:10   ` Alex Williamson
  2020-05-04 23:20     ` Kirti Wankhede
  2020-04-01 19:03   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2020-03-26 19:10 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Wed, 25 Mar 2020 02:39:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
> for VFIO devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ab295d25620e..1827b7cfb316 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -9,6 +9,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/main-loop.h"
> +#include <sys/ioctl.h>
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
> @@ -296,6 +297,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
> +{
> +    int ret;
> +    VFIOContainer *container = vbasedev->group->container;
> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
> +        .argsz = sizeof(dirty),
> +    };
> +
> +    if (start) {
> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
> +        } else {
> +            return 0;
> +        }
> +    } else {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
> +    }

Dirty logging and device saving are logically separate, why do we link
them here?

Why do we return success when we want to start logging if we haven't
started logging?

> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
> +    if (ret) {
> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
> +                     dirty.flags, errno);
> +    }
> +    return ret;
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -330,6 +357,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>       */
>      qemu_put_be64(f, migration->region.size);
>  
> +    ret = vfio_start_dirty_page_tracking(vbasedev, true);
> +    if (ret) {
> +        return ret;
> +    }
> +

Haven't we corrupted the migration stream by exiting here?  Maybe this
implies the entire migration fails, therefore we don't need to add the
end marker?  Thanks,

Alex

>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>  
>      ret = qemu_file_get_error(f);
> @@ -346,6 +378,8 @@ static void vfio_save_cleanup(void *opaque)
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>  
> +    vfio_start_dirty_page_tracking(vbasedev, false);
> +
>      if (migration->region.mmaps) {
>          vfio_region_unmap(&migration->region);
>      }
> @@ -669,6 +703,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>          if (ret) {
>              error_report("%s: Failed to set state RUNNING", vbasedev->name);
>          }
> +
> +        vfio_start_dirty_page_tracking(vbasedev, false);
>      }
>  }
>  



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-03-24 21:09 ` [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
  2020-03-25  2:19   ` Yan Zhao
@ 2020-03-26 19:46   ` Alex Williamson
  2020-04-01 19:08     ` Dr. David Alan Gilbert
  2020-04-01  5:50   ` Yan Zhao
  2 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2020-03-26 19:46 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Wed, 25 Mar 2020 02:39:12 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> vfio_listener_log_sync gets list of dirty pages from container using
> VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
> devices are stopped and saving state.
> Return early for the RAM block section of mapped MMIO region.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c     | 200 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio/trace-events |   1 +
>  2 files changed, 196 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 4a2f0d6a2233..6d41e1ac5c2f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "hw/vfio/vfio.h"
>  #include "exec/address-spaces.h"
>  #include "exec/memory.h"
> +#include "exec/ram_addr.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
> @@ -38,6 +39,7 @@
>  #include "sysemu/reset.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> +#include "migration/migration.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
>  };
>  
>  /*
> + * Device state interfaces
> + */
> +
> +static bool vfio_devices_are_stopped_and_saving(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> +                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +                continue;
> +            } else {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
> +/*
>   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>   */
>  static int vfio_dma_unmap(VFIOContainer *container,
> @@ -408,8 +432,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  }
>  
>  /* Called with rcu_read_lock held.  */
> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> -                           bool *read_only)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> +                               ram_addr_t *ram_addr, bool *read_only)
>  {
>      MemoryRegion *mr;
>      hwaddr xlat;
> @@ -440,9 +464,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>          return false;
>      }
>  
> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -    *read_only = !writable || mr->readonly;
> +    if (vaddr) {
> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +    }
>  
> +    if (ram_addr) {
> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> +    }
> +
> +    if (read_only) {
> +        *read_only = !writable || mr->readonly;
> +    }
>      return true;
>  }
>  
> @@ -467,7 +499,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      rcu_read_lock();
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
>              goto out;
>          }
>          /*
> @@ -813,9 +845,167 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static int vfio_get_dirty_bitmap(MemoryListener *listener,
> +                                 MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    VFIOGuestIOMMU *giommu;
> +    IOMMUTLBEntry iotlb;
> +    hwaddr granularity, address_limit, iova;
> +    int ret;
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +            if (MEMORY_REGION(giommu->iommu) == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
> +                break;
> +            }
> +        }
> +
> +        if (!giommu) {
> +            return -EINVAL;
> +        }
> +    }
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        granularity = memory_region_iommu_get_min_page_size(giommu->iommu);
> +
> +        address_limit = MIN(int128_get64(section->size),
> +                            memory_region_iommu_get_address_limit(giommu->iommu,
> +                                                 int128_get64(section->size)));
> +    } else {
> +        granularity = memory_region_size(section->mr);
> +        address_limit = int128_get64(section->size);
> +    }
> +
> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +
> +    RCU_READ_LOCK_GUARD();
> +
> +    while (iova < address_limit) {
> +        struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> +        struct vfio_iommu_type1_dirty_bitmap_get *range;
> +        ram_addr_t start, pages;
> +        uint64_t iova_xlat, size;
> +
> +        if (memory_region_is_iommu(section->mr)) {
> +            iotlb = address_space_get_iotlb_entry(container->space->as, iova,
> +                                                 true, MEMTXATTRS_UNSPECIFIED);
> +            if ((iotlb.target_as == NULL) || (iotlb.addr_mask == 0)) {
> +                if ((iova + granularity) < iova) {
> +                    break;
> +                }
> +                iova += granularity;
> +                continue;
> +            }
> +            iova_xlat = iotlb.iova + giommu->iommu_offset;
> +            size = iotlb.addr_mask + 1;
> +        } else {
> +            iova_xlat = iova;
> +            size = address_limit;
> +        }
> +
> +        dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
> +        if (!dbitmap) {

AIUI, QEMU aborts if this fails, so no need to check.  Because reasons.

> +            return -ENOMEM;
> +        }
> +
> +        dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> +        dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +        range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
> +        range->iova = iova_xlat;
> +        range->size = size;
> +
> +        /*
> +         * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> +         * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
> +         * TARGET_PAGE_SIZE.
> +         */
> +        range->bitmap.pgsize = TARGET_PAGE_SIZE;
> +
> +        /*
> +         * Comment from kvm_physical_sync_dirty_bitmap() since same applies here
> +         * XXX bad kernel interface alert
> +         * For dirty bitmap, kernel allocates array of size aligned to
> +         * bits-per-long.  But for case when the kernel is 64bits and
> +         * the userspace is 32bits, userspace can't align to the same
> +         * bits-per-long, since sizeof(long) is different between kernel
> +         * and user space.  This way, userspace will provide buffer which
> +         * may be 4 bytes less than the kernel will use, resulting in
> +         * userspace memory corruption (which is not detectable by valgrind
> +         * too, in most cases).
> +         * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
> +         * a hope that sizeof(long) won't become >8 any time soon.
> +         */

This seems like the problem we've avoided by defining our bitmap as an
array of u64s rather than an array of longs.  Does this comment really
still apply?

> +
> +        pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
> +        range->bitmap.size = ROUND_UP(pages, 64) / 8;

ROUND_UP(npages/8, sizeof(u64))?

> +        range->bitmap.data = g_malloc0(range->bitmap.size);

We don't require this to be pre-zero'd currently.

> +        if (range->bitmap.data == NULL) {

Same as above.  Seems strange to me too.

> +            error_report("Error allocating bitmap of size 0x%llx",
> +                         range->bitmap.size);
> +            ret = -ENOMEM;
> +            goto err_out;
> +        }
> +
> +        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> +        if (ret) {
> +            error_report("Failed to get dirty bitmap for iova: 0x%llx "
> +                         "size: 0x%llx err: %d",
> +                         range->iova, range->size, errno);
> +            goto err_out;
> +        }
> +
> +        if (memory_region_is_iommu(section->mr)) {
> +            if (!vfio_get_xlat_addr(&iotlb, NULL, &start, NULL)) {
> +                ret = -EINVAL;
> +                goto err_out;
> +            }
> +        } else {
> +            start = memory_region_get_ram_addr(section->mr) +
> +                    section->offset_within_region + iova -
> +                    TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
> +                                               start, pages);
> +
> +        trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
> +                                    range->bitmap.size, start);
> +err_out:
> +        g_free(range->bitmap.data);
> +        g_free(dbitmap);
> +
> +        if (ret) {
> +            return ret;
> +        }
> +
> +        if ((iova + size) < iova) {
> +            break;
> +        }
> +
> +        iova += size;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_listerner_log_sync(MemoryListener *listener,
> +        MemoryRegionSection *section)
> +{
> +    if (vfio_listener_skipped_section(section)) {
> +        return;
> +    }
> +
> +    if (vfio_devices_are_stopped_and_saving()) {
> +        vfio_get_dirty_bitmap(listener, section);
> +    }
> +}
> +
>  static const MemoryListener vfio_memory_listener = {
>      .region_add = vfio_listener_region_add,
>      .region_del = vfio_listener_region_del,
> +    .log_sync = vfio_listerner_log_sync,
>  };
>  
>  static void vfio_listener_release(VFIOContainer *container)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ac065b559f4e..bc8f35ee9356 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
>  vfio_load_device_config_state(char *name) " (%s)"
>  vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
>  vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 00/16] Add migration support for VFIO devices
  2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
                   ` (16 preceding siblings ...)
  2020-03-24 23:36 ` [PATCH v16 QEMU 00/16] Add migration support for VFIO devices no-reply
@ 2020-03-31 18:34 ` Alex Williamson
  2020-04-01  6:41   ` Yan Zhao
  17 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2020-03-31 18:34 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Wed, 25 Mar 2020 02:38:58 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Hi,
> 
> This Patch set adds migration support for VFIO devices in QEMU.

Hi Kirti,

Do you have any migration data you can share to show that this solution
is viable and useful?  I was chatting with Dave Gilbert and there still
seems to be a concern that we actually have a real-world practical
solution.  We know this is inefficient with QEMU today, vendor pinned
memory will get copied multiple times if we're lucky.  If we're not
lucky we may be copying all of guest RAM repeatedly.  There are known
inefficiencies with vIOMMU, etc.  QEMU could learn new heuristics to
account for some of this and we could potentially report different
bitmaps in different phases through vfio, but let's make sure that
there are useful cases enabled by this first implementation.

With a reasonably sized VM, running a reasonable graphics demo or
workload, can we achieve reasonably live migration?  What kind of
downtime do we achieve and what is the working set size of the pinned
memory?  Intel folks, if you've been able to port to this or similar
code base, please report your results as well, open source consumers
are arguably even more important.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-03-24 21:09 ` [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
  2020-03-25  2:19   ` Yan Zhao
  2020-03-26 19:46   ` Alex Williamson
@ 2020-04-01  5:50   ` Yan Zhao
  2020-04-03 20:11     ` Kirti Wankhede
  2 siblings, 1 reply; 74+ messages in thread
From: Yan Zhao @ 2020-04-01  5:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Wed, Mar 25, 2020 at 05:09:12AM +0800, Kirti Wankhede wrote:
> vfio_listener_log_sync gets list of dirty pages from container using
> VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
> devices are stopped and saving state.
> Return early for the RAM block section of mapped MMIO region.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c     | 200 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio/trace-events |   1 +
>  2 files changed, 196 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 4a2f0d6a2233..6d41e1ac5c2f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "hw/vfio/vfio.h"
>  #include "exec/address-spaces.h"
>  #include "exec/memory.h"
> +#include "exec/ram_addr.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
> @@ -38,6 +39,7 @@
>  #include "sysemu/reset.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> +#include "migration/migration.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
>  };
>  
>  /*
> + * Device state interfaces
> + */
> +
> +static bool vfio_devices_are_stopped_and_saving(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> +                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +                continue;
> +            } else {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
> +/*
>   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>   */
>  static int vfio_dma_unmap(VFIOContainer *container,
> @@ -408,8 +432,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  }
>  
>  /* Called with rcu_read_lock held.  */
> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> -                           bool *read_only)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> +                               ram_addr_t *ram_addr, bool *read_only)
>  {
>      MemoryRegion *mr;
>      hwaddr xlat;
> @@ -440,9 +464,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>          return false;
>      }
>  
> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -    *read_only = !writable || mr->readonly;
> +    if (vaddr) {
> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +    }
>  
> +    if (ram_addr) {
> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> +    }
> +
> +    if (read_only) {
> +        *read_only = !writable || mr->readonly;
> +    }
>      return true;
>  }
>  
> @@ -467,7 +499,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      rcu_read_lock();
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
>              goto out;
>          }
>          /*
> @@ -813,9 +845,167 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static int vfio_get_dirty_bitmap(MemoryListener *listener,
> +                                 MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    VFIOGuestIOMMU *giommu;
> +    IOMMUTLBEntry iotlb;
> +    hwaddr granularity, address_limit, iova;
> +    int ret;
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +            if (MEMORY_REGION(giommu->iommu) == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
> +                break;
> +            }
> +        }
> +
> +        if (!giommu) {
> +            return -EINVAL;
> +        }
> +    }
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        granularity = memory_region_iommu_get_min_page_size(giommu->iommu);
> +
> +        address_limit = MIN(int128_get64(section->size),
> +                            memory_region_iommu_get_address_limit(giommu->iommu,
> +                                                 int128_get64(section->size)));
> +    } else {
> +        granularity = memory_region_size(section->mr);
> +        address_limit = int128_get64(section->size);
> +    }
> +
> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +
> +    RCU_READ_LOCK_GUARD();
> +

the requirement of iova < address_limit is not right. reason as blow:
when vIOMMU is NOT on,
iova is section->offset_within_address_space,
address_limit is section->size,
but iova has not reason to be less than address_limit.

for example, when vm memory size is large than 3G (e.g.4G)
for memory region section of range (0x100000000-0x13fffffff),
its iova is 0x100000000, address_limit is 0x40000000,
then as iova is not less than address_limit, dirty pages query for memory
3G-4G will be skipped.
Therefore dirty pages in 3G-4G will be lost.


> +    while (iova < address_limit) {
> +        struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> +        struct vfio_iommu_type1_dirty_bitmap_get *range;
> +        ram_addr_t start, pages;
> +        uint64_t iova_xlat, size;
> +
> +        if (memory_region_is_iommu(section->mr)) {
> +            iotlb = address_space_get_iotlb_entry(container->space->as, iova,
> +                                                 true, MEMTXATTRS_UNSPECIFIED);
> +            if ((iotlb.target_as == NULL) || (iotlb.addr_mask == 0)) {
> +                if ((iova + granularity) < iova) {
> +                    break;
> +                }
> +                iova += granularity;
> +                continue;
> +            }
> +            iova_xlat = iotlb.iova + giommu->iommu_offset;
> +            size = iotlb.addr_mask + 1;
> +        } else {
> +            iova_xlat = iova;
> +            size = address_limit;
> +        }
> +
> +        dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
> +        if (!dbitmap) {
> +            return -ENOMEM;
> +        }
> +
> +        dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> +        dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +        range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
> +        range->iova = iova_xlat;
> +        range->size = size;
> +
> +        /*
> +         * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> +         * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
> +         * TARGET_PAGE_SIZE.
> +         */
> +        range->bitmap.pgsize = TARGET_PAGE_SIZE;
> +
> +        /*
> +         * Comment from kvm_physical_sync_dirty_bitmap() since same applies here
> +         * XXX bad kernel interface alert
> +         * For dirty bitmap, kernel allocates array of size aligned to
> +         * bits-per-long.  But for case when the kernel is 64bits and
> +         * the userspace is 32bits, userspace can't align to the same
> +         * bits-per-long, since sizeof(long) is different between kernel
> +         * and user space.  This way, userspace will provide buffer which
> +         * may be 4 bytes less than the kernel will use, resulting in
> +         * userspace memory corruption (which is not detectable by valgrind
> +         * too, in most cases).
> +         * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
> +         * a hope that sizeof(long) won't become >8 any time soon.
> +         */
> +
> +        pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
> +        range->bitmap.size = ROUND_UP(pages, 64) / 8;
> +        range->bitmap.data = g_malloc0(range->bitmap.size);
> +        if (range->bitmap.data == NULL) {
> +            error_report("Error allocating bitmap of size 0x%llx",
> +                         range->bitmap.size);
> +            ret = -ENOMEM;
> +            goto err_out;
> +        }
> +
> +        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> +        if (ret) {
> +            error_report("Failed to get dirty bitmap for iova: 0x%llx "
> +                         "size: 0x%llx err: %d",
> +                         range->iova, range->size, errno);
> +            goto err_out;
> +        }
> +
> +        if (memory_region_is_iommu(section->mr)) {
> +            if (!vfio_get_xlat_addr(&iotlb, NULL, &start, NULL)) {
> +                ret = -EINVAL;
> +                goto err_out;
> +            }
> +        } else {
> +            start = memory_region_get_ram_addr(section->mr) +
> +                    section->offset_within_region + iova -
> +                    TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
> +                                               start, pages);
> +
> +        trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
> +                                    range->bitmap.size, start);
> +err_out:
> +        g_free(range->bitmap.data);
> +        g_free(dbitmap);
> +
> +        if (ret) {
> +            return ret;
> +        }
> +
> +        if ((iova + size) < iova) {
> +            break;
> +        }
> +
> +        iova += size;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_listerner_log_sync(MemoryListener *listener,
> +        MemoryRegionSection *section)
> +{
> +    if (vfio_listener_skipped_section(section)) {
> +        return;
> +    }
> +
> +    if (vfio_devices_are_stopped_and_saving()) {
> +        vfio_get_dirty_bitmap(listener, section);
> +    }
> +}
> +
>  static const MemoryListener vfio_memory_listener = {
>      .region_add = vfio_listener_region_add,
>      .region_del = vfio_listener_region_del,
> +    .log_sync = vfio_listerner_log_sync,
>  };
>  
>  static void vfio_listener_release(VFIOContainer *container)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ac065b559f4e..bc8f35ee9356 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
>  vfio_load_device_config_state(char *name) " (%s)"
>  vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
>  vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 00/16] Add migration support for VFIO devices
  2020-03-31 18:34 ` Alex Williamson
@ 2020-04-01  6:41   ` Yan Zhao
  2020-04-01 18:34     ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Yan Zhao @ 2020-04-01  6:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Wed, Apr 01, 2020 at 02:34:24AM +0800, Alex Williamson wrote:
> On Wed, 25 Mar 2020 02:38:58 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > Hi,
> > 
> > This Patch set adds migration support for VFIO devices in QEMU.
> 
> Hi Kirti,
> 
> Do you have any migration data you can share to show that this solution
> is viable and useful?  I was chatting with Dave Gilbert and there still
> seems to be a concern that we actually have a real-world practical
> solution.  We know this is inefficient with QEMU today, vendor pinned
> memory will get copied multiple times if we're lucky.  If we're not
> lucky we may be copying all of guest RAM repeatedly.  There are known
> inefficiencies with vIOMMU, etc.  QEMU could learn new heuristics to
> account for some of this and we could potentially report different
> bitmaps in different phases through vfio, but let's make sure that
> there are useful cases enabled by this first implementation.
> 
> With a reasonably sized VM, running a reasonable graphics demo or
> workload, can we achieve reasonably live migration?  What kind of
> downtime do we achieve and what is the working set size of the pinned
> memory?  Intel folks, if you've been able to port to this or similar
> code base, please report your results as well, open source consumers
> are arguably even more important.  Thanks,
> 
hi Alex
we're in the process of porting to this code, and now it's able to
migrate successfully without dirty pages.

when there're dirty pages, we met several issues.
one of them is reported here
(https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg00004.html).
dirty pages for some regions are not able to be collected correctly,
especially for memory range from 3G to 4G.

even without this bug, qemu still got stuck in middle before
reaching stop-and-copy phase and cannot be killed by admin.
still in debugging of this problem.

Thanks
Yan



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 07/16] vfio: Add migration state change notifier
  2020-03-24 21:09 ` [PATCH v16 QEMU 07/16] vfio: Add migration state change notifier Kirti Wankhede
@ 2020-04-01 11:27   ` Dr. David Alan Gilbert
  2020-05-04 23:20     ` Kirti Wankhede
  0 siblings, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-04-01 11:27 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Added migration state change notifier to get notification on migration state
> change. These states are translated to VFIO device state and conveyed to vendor
> driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  1 +
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 31 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index af9443c275fb..22ded9d28cf3 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -154,6 +154,27 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>      }
>  }
>  
> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    trace_vfio_migration_state_notifier(vbasedev->name, s->state);

You might want to use MigrationStatus_str(s->status) to make that
readable.

> +    switch (s->state) {
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> +                      VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +        }

In the migration code we check to see if the VM was running prior to the
start of the migration before we start the CPUs going again (see
migration_iteration_finish):
    case MIGRATION_STATUS_FAILED:
    case MIGRATION_STATUS_CANCELLED:
    case MIGRATION_STATUS_CANCELLING:
        if (s->vm_was_running) {
            vm_start();
        } else {
            if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
                runstate_set(RUN_STATE_POSTMIGRATE);
            }

so if the guest was paused before a migration we don't falsely restart
it.  Maybe you need something similar?

Dave

> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -173,6 +194,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>  
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> +
>      return 0;
>  }
>  
> @@ -211,6 +235,11 @@ add_blocker:
>  
>  void vfio_migration_finalize(VFIODevice *vbasedev)
>  {
> +
> +    if (vbasedev->migration_state.notify) {
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
>      if (vbasedev->vm_state) {
>          qemu_del_vm_change_state_handler(vbasedev->vm_state);
>      }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 3d15bacd031a..69503228f20e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -148,3 +148,4 @@ vfio_display_edid_write_error(void) ""
>  vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> +vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 3d18eb146b33..28f55f66d019 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>      VMChangeStateEntry *vm_state;
>      uint32_t device_state;
>      int vm_running;
> +    Notifier migration_state;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-03-24 21:09 ` [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
  2020-03-25 21:02   ` Alex Williamson
@ 2020-04-01 17:36   ` Dr. David Alan Gilbert
  2020-05-04 23:20     ` Kirti Wankhede
  1 sibling, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-04-01 17:36 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 78 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 22ded9d28cf3..033f76526e49 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,6 +8,7 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
> @@ -24,6 +25,17 @@
>  #include "pci.h"
>  #include "trace.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -126,6 +138,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region);
> +        qemu_mutex_unlock_iothread();
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.index,
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> +        return ret;
> +    }
> +
> +    /*
> +     * Save migration region size. This is used to verify migration region size
> +     * is greater than or equal to migration region size at destination
> +     */
> +    qemu_put_be64(f, migration->region.size);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);

OK, good, so now we can change that to something else if you want to
migrate something extra in the future.

> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_setup(vbasedev->name);

I'd put that trace at the start of the function.

> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.mmaps) {
> +        vfio_region_unmap(&migration->region);
> +    }
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -191,6 +266,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    register_savevm_live("vfio", -1, 1, &savevm_vfio_handlers, vbasedev);

That doesn't look right to me;  firstly the -1 should now be
VMSTATE_INSTANCE_ID_ANY - after the recent change in commit 1df2c9a

Have you tried this with two vfio devices?
This is quite rare - it's an iterative device that can have
multiple instances;  if you look at 'ram' for example, all the RAM
instances are handled inside the save_setup/save for the one instance of
'ram'.  I think here you're trying to register an individual vfio
device, so if you had multiple devices you'd see this called twice.

So either you need to make vfio_save_* do all of the devices in a loop -
which feels like a bad idea;  or replace "vfio" in that call by a unique
device name;  as long as your device has a bus path then you should be
able to use the same trick vmstate_register_with_alias_id does, and use
I think,  vmstate_if_get_id(VMSTAETE_IF(vbasedev)).

but it might take some experimentation since this is an odd use.

>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 69503228f20e..4bb43f18f315 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
> +vfio_save_setup(char *name) " (%s)"
> +vfio_save_cleanup(char *name) " (%s)"
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 00/16] Add migration support for VFIO devices
  2020-04-01  6:41   ` Yan Zhao
@ 2020-04-01 18:34     ` Alex Williamson
  0 siblings, 0 replies; 74+ messages in thread
From: Alex Williamson @ 2020-04-01 18:34 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Wed, 1 Apr 2020 02:41:54 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Wed, Apr 01, 2020 at 02:34:24AM +0800, Alex Williamson wrote:
> > On Wed, 25 Mar 2020 02:38:58 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > Hi,
> > > 
> > > This Patch set adds migration support for VFIO devices in QEMU.  
> > 
> > Hi Kirti,
> > 
> > Do you have any migration data you can share to show that this solution
> > is viable and useful?  I was chatting with Dave Gilbert and there still
> > seems to be a concern that we actually have a real-world practical
> > solution.  We know this is inefficient with QEMU today, vendor pinned
> > memory will get copied multiple times if we're lucky.  If we're not
> > lucky we may be copying all of guest RAM repeatedly.  There are known
> > inefficiencies with vIOMMU, etc.  QEMU could learn new heuristics to
> > account for some of this and we could potentially report different
> > bitmaps in different phases through vfio, but let's make sure that
> > there are useful cases enabled by this first implementation.
> > 
> > With a reasonably sized VM, running a reasonable graphics demo or
> > workload, can we achieve reasonably live migration?  What kind of
> > downtime do we achieve and what is the working set size of the pinned
> > memory?  Intel folks, if you've been able to port to this or similar
> > code base, please report your results as well, open source consumers
> > are arguably even more important.  Thanks,
> >   
> hi Alex
> we're in the process of porting to this code, and now it's able to
> migrate successfully without dirty pages.
> 
> when there're dirty pages, we met several issues.
> one of them is reported here
> (https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg00004.html).
> dirty pages for some regions are not able to be collected correctly,
> especially for memory range from 3G to 4G.
> 
> even without this bug, qemu still got stuck in middle before
> reaching stop-and-copy phase and cannot be killed by admin.
> still in debugging of this problem.

Thanks, Yan.  So it seems we have various bugs, known limitations, and
we haven't actually proven that this implementation provides a useful
feature, at least for the open source consumer.  This doesn't give me
much confidence to consider the kernel portion ready for v5.7 given how
late we are already :-\  Thanks,

Alex



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 10/16] vfio: Add load state functions to SaveVMHandlers
  2020-03-24 21:09 ` [PATCH v16 QEMU 10/16] vfio: Add load " Kirti Wankhede
  2020-03-25 22:36   ` Alex Williamson
@ 2020-04-01 18:58   ` Dr. David Alan Gilbert
  2020-05-04 23:20     ` Kirti Wankhede
  1 sibling, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-04-01 18:58 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Sequence  during _RESUMING device state:
> While data for this device is available, repeat below steps:
> a. read data_offset from where user application should write data.
> b. write data of data_size to migration region from data_offset.
> c. write data_size which indicates vendor driver that data is written in
>    staging buffer.
> 
> For user, data is opaque. User should write data in the same order as
> received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 179 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |   3 +
>  2 files changed, 182 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ecbeed5182c2..ab295d25620e 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -269,6 +269,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    uint64_t data;
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> +        int ret;
> +
> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> +        if (ret) {
> +            error_report("%s: Failed to load device config space",
> +                         vbasedev->name);
> +            return ret;
> +        }
> +    }
> +
> +    data = qemu_get_be64(f);
> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("%s: Failed loading device config space, "
> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> +        return -EINVAL;
> +    }
> +
> +    trace_vfio_load_device_config_state(vbasedev->name);
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -434,12 +461,164 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      return ret;
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +
> +    if (migration->region.mmaps) {
> +        ret = vfio_region_mmap(&migration->region);
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_RESUMING);
> +    if (ret) {
> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> +    }
> +    return ret;
> +}
> +
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    vfio_save_cleanup(opaque);
> +    return 0;
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +    uint64_t data, data_size;
> +
> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +
> +        trace_vfio_load_state(vbasedev->name, data);
> +
> +        switch (data) {
> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> +        {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> +        {
> +            uint64_t region_size = qemu_get_be64(f);
> +
> +            if (migration->region.size < region_size) {
> +                error_report("%s: SETUP STATE: migration region too small, "
> +                             "0x%"PRIx64 " < 0x%"PRIx64, vbasedev->name,
> +                             migration->region.size, region_size);
> +                return -EINVAL;
> +            }
> +
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {

Can you explain why you're reading this here rather than letting it drop
through to the read at the end of the loop?

> +                return ret;
> +            } else {
> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
> +                             vbasedev->name, data);
> +                return -EINVAL;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> +        {
> +            VFIORegion *region = &migration->region;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +            uint64_t data_offset = 0;
> +
> +            data_size = qemu_get_be64(f);
> +            if (data_size == 0) {
> +                break;
> +            }
> +
> +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                        region->fd_offset +
> +                        offsetof(struct vfio_device_migration_info,
> +                        data_offset));
> +            if (ret != sizeof(data_offset)) {
> +                error_report("%s:Failed to get migration buffer data offset %d",
> +                             vbasedev->name, ret);
> +                return -EINVAL;
> +            }
> +
> +            if (region->mmaps) {
> +                buf = find_data_region(region, data_offset, data_size);
> +            }
> +
> +            buffer_mmaped = (buf != NULL) ? true : false;
> +
> +            if (!buffer_mmaped) {
> +                buf = g_try_malloc0(data_size);

data_size has been read off the wire at this point; can we sanity check
it?

> +                if (!buf) {
> +                    error_report("%s: Error allocating buffer ", __func__);
> +                    return -ENOMEM;
> +                }
> +            }
> +
> +            qemu_get_buffer(f, buf, data_size);
> +
> +            if (!buffer_mmaped) {
> +                ret = pwrite(vbasedev->fd, buf, data_size,
> +                             region->fd_offset + data_offset);
> +                g_free(buf);
> +
> +                if (ret != data_size) {
> +                    error_report("%s: Failed to set migration buffer %d",
> +                                 vbasedev->name, ret);
> +                    return -EINVAL;
> +                }
> +            }
> +
> +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
> +                         region->fd_offset +
> +                       offsetof(struct vfio_device_migration_info, data_size));
> +            if (ret != sizeof(data_size)) {
> +                error_report("%s: Failed to set migration buffer data size %d",
> +                             vbasedev->name, ret);
> +                if (!buffer_mmaped) {
> +                    g_free(buf);
> +                }
> +                return -EINVAL;
> +            }
> +
> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> +                                              data_size);
> +            break;
> +        }

I'd add here a default:  that complains about an unknown tag.

> +        }
> +
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +        data = qemu_get_be64(f);

I'd also check file_get_error again at this point; if you're unlucky you
get junk in 'data' and things get more confusing.

> +    }
> +
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
>      .save_live_pending = vfio_save_pending,
>      .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .load_state = vfio_load_state,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bdf40ba368c7..ac065b559f4e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -157,3 +157,6 @@ vfio_save_device_config_state(char *name) " (%s)"
>  vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>  vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
>  vfio_save_complete_precopy(char *name) " (%s)"
> +vfio_load_device_config_state(char *name) " (%s)"
> +vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64

Please use const char*'s in traces.

> +vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 12/16] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  2020-03-24 21:09 ` [PATCH v16 QEMU 12/16] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
@ 2020-04-01 19:00   ` Dr. David Alan Gilbert
  2020-04-01 19:42     ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-04-01 19:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/memory.c b/memory.c
> index acb7546971c3..285ca2ed6dd9 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1788,7 +1788,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
>  uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>  {
>      uint8_t mask = mr->dirty_log_mask;
> -    if (global_dirty_log && mr->ram_block) {
> +    if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
>          mask |= (1 << DIRTY_MEMORY_MIGRATION);

I'm missing why the two go together here.
What does 'is_iommu' really mean?

Dave

>      }
>      return mask;
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking
  2020-03-24 21:09 ` [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
  2020-03-26 19:10   ` Alex Williamson
@ 2020-04-01 19:03   ` Dr. David Alan Gilbert
  2020-05-04 23:21     ` Kirti Wankhede
  1 sibling, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-04-01 19:03 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
> for VFIO devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ab295d25620e..1827b7cfb316 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -9,6 +9,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/main-loop.h"
> +#include <sys/ioctl.h>
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
> @@ -296,6 +297,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
> +{
> +    int ret;
> +    VFIOContainer *container = vbasedev->group->container;
> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
> +        .argsz = sizeof(dirty),
> +    };
> +
> +    if (start) {
> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
> +        } else {
> +            return 0;
> +        }
> +    } else {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
> +    if (ret) {
> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
> +                     dirty.flags, errno);
> +    }
> +    return ret;
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -330,6 +357,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>       */
>      qemu_put_be64(f, migration->region.size);
>  
> +    ret = vfio_start_dirty_page_tracking(vbasedev, true);
> +    if (ret) {
> +        return ret;
> +    }
> +
>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>  
>      ret = qemu_file_get_error(f);
> @@ -346,6 +378,8 @@ static void vfio_save_cleanup(void *opaque)
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>  
> +    vfio_start_dirty_page_tracking(vbasedev, false);

Shouldn't you check the return value?

> +
>      if (migration->region.mmaps) {
>          vfio_region_unmap(&migration->region);
>      }
> @@ -669,6 +703,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>          if (ret) {
>              error_report("%s: Failed to set state RUNNING", vbasedev->name);
>          }
> +
> +        vfio_start_dirty_page_tracking(vbasedev, false);
>      }
>  }
>  
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-03-26 19:46   ` Alex Williamson
@ 2020-04-01 19:08     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-04-01 19:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, yan.y.zhao,
	changpeng.liu, Ken.Xue

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Wed, 25 Mar 2020 02:39:12 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > vfio_listener_log_sync gets list of dirty pages from container using
> > VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
> > devices are stopped and saving state.
> > Return early for the RAM block section of mapped MMIO region.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  hw/vfio/common.c     | 200 +++++++++++++++++++++++++++++++++++++++++++++++++--
> >  hw/vfio/trace-events |   1 +
> >  2 files changed, 196 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 4a2f0d6a2233..6d41e1ac5c2f 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -29,6 +29,7 @@
> >  #include "hw/vfio/vfio.h"
> >  #include "exec/address-spaces.h"
> >  #include "exec/memory.h"
> > +#include "exec/ram_addr.h"
> >  #include "hw/hw.h"
> >  #include "qemu/error-report.h"
> >  #include "qemu/main-loop.h"
> > @@ -38,6 +39,7 @@
> >  #include "sysemu/reset.h"
> >  #include "trace.h"
> >  #include "qapi/error.h"
> > +#include "migration/migration.h"
> >  
> >  VFIOGroupList vfio_group_list =
> >      QLIST_HEAD_INITIALIZER(vfio_group_list);
> > @@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
> >  };
> >  
> >  /*
> > + * Device state interfaces
> > + */
> > +
> > +static bool vfio_devices_are_stopped_and_saving(void)
> > +{
> > +    VFIOGroup *group;
> > +    VFIODevice *vbasedev;
> > +
> > +    QLIST_FOREACH(group, &vfio_group_list, next) {
> > +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> > +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> > +                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> > +                continue;
> > +            } else {
> > +                return false;
> > +            }
> > +        }
> > +    }
> > +    return true;
> > +}
> > +
> > +/*
> >   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
> >   */
> >  static int vfio_dma_unmap(VFIOContainer *container,
> > @@ -408,8 +432,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >  }
> >  
> >  /* Called with rcu_read_lock held.  */
> > -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> > -                           bool *read_only)
> > +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> > +                               ram_addr_t *ram_addr, bool *read_only)
> >  {
> >      MemoryRegion *mr;
> >      hwaddr xlat;
> > @@ -440,9 +464,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> >          return false;
> >      }
> >  
> > -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> > -    *read_only = !writable || mr->readonly;
> > +    if (vaddr) {
> > +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> > +    }
> >  
> > +    if (ram_addr) {
> > +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> > +    }
> > +
> > +    if (read_only) {
> > +        *read_only = !writable || mr->readonly;
> > +    }
> >      return true;
> >  }
> >  
> > @@ -467,7 +499,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >      rcu_read_lock();
> >  
> >      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> > -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> > +        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
> >              goto out;
> >          }
> >          /*
> > @@ -813,9 +845,167 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >      }
> >  }
> >  
> > +static int vfio_get_dirty_bitmap(MemoryListener *listener,
> > +                                 MemoryRegionSection *section)
> > +{
> > +    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> > +    VFIOGuestIOMMU *giommu;
> > +    IOMMUTLBEntry iotlb;
> > +    hwaddr granularity, address_limit, iova;
> > +    int ret;
> > +
> > +    if (memory_region_is_iommu(section->mr)) {
> > +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> > +            if (MEMORY_REGION(giommu->iommu) == section->mr &&
> > +                giommu->n.start == section->offset_within_region) {
> > +                break;
> > +            }
> > +        }
> > +
> > +        if (!giommu) {
> > +            return -EINVAL;
> > +        }
> > +    }
> > +
> > +    if (memory_region_is_iommu(section->mr)) {
> > +        granularity = memory_region_iommu_get_min_page_size(giommu->iommu);
> > +
> > +        address_limit = MIN(int128_get64(section->size),
> > +                            memory_region_iommu_get_address_limit(giommu->iommu,
> > +                                                 int128_get64(section->size)));
> > +    } else {
> > +        granularity = memory_region_size(section->mr);
> > +        address_limit = int128_get64(section->size);
> > +    }
> > +
> > +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> > +
> > +    RCU_READ_LOCK_GUARD();
> > +
> > +    while (iova < address_limit) {
> > +        struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> > +        struct vfio_iommu_type1_dirty_bitmap_get *range;
> > +        ram_addr_t start, pages;
> > +        uint64_t iova_xlat, size;
> > +
> > +        if (memory_region_is_iommu(section->mr)) {
> > +            iotlb = address_space_get_iotlb_entry(container->space->as, iova,
> > +                                                 true, MEMTXATTRS_UNSPECIFIED);
> > +            if ((iotlb.target_as == NULL) || (iotlb.addr_mask == 0)) {
> > +                if ((iova + granularity) < iova) {
> > +                    break;
> > +                }
> > +                iova += granularity;
> > +                continue;
> > +            }
> > +            iova_xlat = iotlb.iova + giommu->iommu_offset;
> > +            size = iotlb.addr_mask + 1;
> > +        } else {
> > +            iova_xlat = iova;
> > +            size = address_limit;
> > +        }
> > +
> > +        dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
> > +        if (!dbitmap) {
> 
> AIUI, QEMU aborts if this fails, so no need to check.  Because reasons.

It does if you use g_malloc0; however if the data is large you can
use g_try_malloc0 and then it will return NULL and you can fail the
migration rather than nuking the VM.
(We often argue whether it's worth testing or not, I think generally 
if the size is 'large' or user defined then use try, if it's small
then assume it works.  We've never defined small or large; but somewhere
around a few pages is about right.

Dave

> > +            return -ENOMEM;
> > +        }
> > +
> > +        dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> > +        dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> > +        range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
> > +        range->iova = iova_xlat;
> > +        range->size = size;
> > +
> > +        /*
> > +         * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> > +         * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
> > +         * TARGET_PAGE_SIZE.
> > +         */
> > +        range->bitmap.pgsize = TARGET_PAGE_SIZE;
> > +
> > +        /*
> > +         * Comment from kvm_physical_sync_dirty_bitmap() since same applies here
> > +         * XXX bad kernel interface alert
> > +         * For dirty bitmap, kernel allocates array of size aligned to
> > +         * bits-per-long.  But for case when the kernel is 64bits and
> > +         * the userspace is 32bits, userspace can't align to the same
> > +         * bits-per-long, since sizeof(long) is different between kernel
> > +         * and user space.  This way, userspace will provide buffer which
> > +         * may be 4 bytes less than the kernel will use, resulting in
> > +         * userspace memory corruption (which is not detectable by valgrind
> > +         * too, in most cases).
> > +         * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
> > +         * a hope that sizeof(long) won't become >8 any time soon.
> > +         */
> 
> This seems like the problem we've avoided by defining our bitmap as an
> array of u64s rather than an array of longs.  Does this comment really
> still apply?
> 
> > +
> > +        pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
> > +        range->bitmap.size = ROUND_UP(pages, 64) / 8;
> 
> ROUND_UP(npages/8, sizeof(u64))?
> 
> > +        range->bitmap.data = g_malloc0(range->bitmap.size);
> 
> We don't require this to be pre-zero'd currently.
> 
> > +        if (range->bitmap.data == NULL) {
> 
> Same as above.  Seems strange to me too.
> 
> > +            error_report("Error allocating bitmap of size 0x%llx",
> > +                         range->bitmap.size);
> > +            ret = -ENOMEM;
> > +            goto err_out;
> > +        }
> > +
> > +        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> > +        if (ret) {
> > +            error_report("Failed to get dirty bitmap for iova: 0x%llx "
> > +                         "size: 0x%llx err: %d",
> > +                         range->iova, range->size, errno);
> > +            goto err_out;
> > +        }
> > +
> > +        if (memory_region_is_iommu(section->mr)) {
> > +            if (!vfio_get_xlat_addr(&iotlb, NULL, &start, NULL)) {
> > +                ret = -EINVAL;
> > +                goto err_out;
> > +            }
> > +        } else {
> > +            start = memory_region_get_ram_addr(section->mr) +
> > +                    section->offset_within_region + iova -
> > +                    TARGET_PAGE_ALIGN(section->offset_within_address_space);
> > +        }
> > +
> > +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
> > +                                               start, pages);
> > +
> > +        trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
> > +                                    range->bitmap.size, start);
> > +err_out:
> > +        g_free(range->bitmap.data);
> > +        g_free(dbitmap);
> > +
> > +        if (ret) {
> > +            return ret;
> > +        }
> > +
> > +        if ((iova + size) < iova) {
> > +            break;
> > +        }
> > +
> > +        iova += size;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_listerner_log_sync(MemoryListener *listener,
> > +        MemoryRegionSection *section)
> > +{
> > +    if (vfio_listener_skipped_section(section)) {
> > +        return;
> > +    }
> > +
> > +    if (vfio_devices_are_stopped_and_saving()) {
> > +        vfio_get_dirty_bitmap(listener, section);
> > +    }
> > +}
> > +
> >  static const MemoryListener vfio_memory_listener = {
> >      .region_add = vfio_listener_region_add,
> >      .region_del = vfio_listener_region_del,
> > +    .log_sync = vfio_listerner_log_sync,
> >  };
> >  
> >  static void vfio_listener_release(VFIOContainer *container)
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index ac065b559f4e..bc8f35ee9356 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
> >  vfio_load_device_config_state(char *name) " (%s)"
> >  vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
> >  vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> > +vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 12/16] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  2020-04-01 19:00   ` Dr. David Alan Gilbert
@ 2020-04-01 19:42     ` Alex Williamson
  0 siblings, 0 replies; 74+ messages in thread
From: Alex Williamson @ 2020-04-01 19:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, Kirti Wankhede, eauger, felipe, jonathan.davies, yan.y.zhao,
	changpeng.liu, Ken.Xue

On Wed, 1 Apr 2020 20:00:32 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > ---
> >  memory.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/memory.c b/memory.c
> > index acb7546971c3..285ca2ed6dd9 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1788,7 +1788,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
> >  uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
> >  {
> >      uint8_t mask = mr->dirty_log_mask;
> > -    if (global_dirty_log && mr->ram_block) {
> > +    if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
> >          mask |= (1 << DIRTY_MEMORY_MIGRATION);  
> 
> I'm missing why the two go together here.
> What does 'is_iommu' really mean?

I take that to mean MemoryRegion is translated by an IOMMU, ie. it's an
IOVA range of the IOMMU.  Therefore we're adding it to dirty log
tracking, just as we do for ram blocks.  At least that's my
interpretation of what it's supposed to do, I'm not an expert here on
whether it's the right way to do that.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-04-01  5:50   ` Yan Zhao
@ 2020-04-03 20:11     ` Kirti Wankhede
  0 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-04-03 20:11 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 4/1/2020 11:20 AM, Yan Zhao wrote:
> On Wed, Mar 25, 2020 at 05:09:12AM +0800, Kirti Wankhede wrote:
>> vfio_listener_log_sync gets list of dirty pages from container using
>> VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
>> devices are stopped and saving state.
>> Return early for the RAM block section of mapped MMIO region.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/common.c     | 200 +++++++++++++++++++++++++++++++++++++++++++++++++--
>>   hw/vfio/trace-events |   1 +
>>   2 files changed, 196 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 4a2f0d6a2233..6d41e1ac5c2f 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -29,6 +29,7 @@
>>   #include "hw/vfio/vfio.h"
>>   #include "exec/address-spaces.h"
>>   #include "exec/memory.h"
>> +#include "exec/ram_addr.h"
>>   #include "hw/hw.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/main-loop.h"
>> @@ -38,6 +39,7 @@
>>   #include "sysemu/reset.h"
>>   #include "trace.h"
>>   #include "qapi/error.h"
>> +#include "migration/migration.h"
>>   
>>   VFIOGroupList vfio_group_list =
>>       QLIST_HEAD_INITIALIZER(vfio_group_list);
>> @@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
>>   };
>>   
>>   /*
>> + * Device state interfaces
>> + */
>> +
>> +static bool vfio_devices_are_stopped_and_saving(void)
>> +{
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(group, &vfio_group_list, next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
>> +                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
>> +                continue;
>> +            } else {
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +    return true;
>> +}
>> +
>> +/*
>>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>>    */
>>   static int vfio_dma_unmap(VFIOContainer *container,
>> @@ -408,8 +432,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>   }
>>   
>>   /* Called with rcu_read_lock held.  */
>> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>> -                           bool *read_only)
>> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> +                               ram_addr_t *ram_addr, bool *read_only)
>>   {
>>       MemoryRegion *mr;
>>       hwaddr xlat;
>> @@ -440,9 +464,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>>           return false;
>>       }
>>   
>> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> -    *read_only = !writable || mr->readonly;
>> +    if (vaddr) {
>> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> +    }
>>   
>> +    if (ram_addr) {
>> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
>> +    }
>> +
>> +    if (read_only) {
>> +        *read_only = !writable || mr->readonly;
>> +    }
>>       return true;
>>   }
>>   
>> @@ -467,7 +499,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       rcu_read_lock();
>>   
>>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
>> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
>>               goto out;
>>           }
>>           /*
>> @@ -813,9 +845,167 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>       }
>>   }
>>   
>> +static int vfio_get_dirty_bitmap(MemoryListener *listener,
>> +                                 MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>> +    VFIOGuestIOMMU *giommu;
>> +    IOMMUTLBEntry iotlb;
>> +    hwaddr granularity, address_limit, iova;
>> +    int ret;
>> +
>> +    if (memory_region_is_iommu(section->mr)) {
>> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>> +            if (MEMORY_REGION(giommu->iommu) == section->mr &&
>> +                giommu->n.start == section->offset_within_region) {
>> +                break;
>> +            }
>> +        }
>> +
>> +        if (!giommu) {
>> +            return -EINVAL;
>> +        }
>> +    }
>> +
>> +    if (memory_region_is_iommu(section->mr)) {
>> +        granularity = memory_region_iommu_get_min_page_size(giommu->iommu);
>> +
>> +        address_limit = MIN(int128_get64(section->size),
>> +                            memory_region_iommu_get_address_limit(giommu->iommu,
>> +                                                 int128_get64(section->size)));
>> +    } else {
>> +        granularity = memory_region_size(section->mr);
>> +        address_limit = int128_get64(section->size);
>> +    }
>> +
>> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +
>> +    RCU_READ_LOCK_GUARD();
>> +
> 
> the requirement of iova < address_limit is not right. reason as blow:
> when vIOMMU is NOT on,
> iova is section->offset_within_address_space,
> address_limit is section->size,
> but iova has not reason to be less than address_limit.
> 
> for example, when vm memory size is large than 3G (e.g.4G)
> for memory region section of range (0x100000000-0x13fffffff),
> its iova is 0x100000000, address_limit is 0x40000000,
> then as iova is not less than address_limit, dirty pages query for memory
> 3G-4G will be skipped.
> Therefore dirty pages in 3G-4G will be lost.
> 
> 

Right, thanks. Fixing it. address_limit should be iova + size - 1

I did basic API testing with 2G memory. I'll make sure to test with more 
than 4G to catch such bugs.

Thanks,
Kirti


>> +    while (iova < address_limit) {
>> +        struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>> +        struct vfio_iommu_type1_dirty_bitmap_get *range;
>> +        ram_addr_t start, pages;
>> +        uint64_t iova_xlat, size;
>> +
>> +        if (memory_region_is_iommu(section->mr)) {
>> +            iotlb = address_space_get_iotlb_entry(container->space->as, iova,
>> +                                                 true, MEMTXATTRS_UNSPECIFIED);
>> +            if ((iotlb.target_as == NULL) || (iotlb.addr_mask == 0)) {
>> +                if ((iova + granularity) < iova) {
>> +                    break;
>> +                }
>> +                iova += granularity;
>> +                continue;
>> +            }
>> +            iova_xlat = iotlb.iova + giommu->iommu_offset;
>> +            size = iotlb.addr_mask + 1;
>> +        } else {
>> +            iova_xlat = iova;
>> +            size = address_limit;
>> +        }
>> +
>> +        dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
>> +        if (!dbitmap) {
>> +            return -ENOMEM;
>> +        }
>> +
>> +        dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
>> +        dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>> +        range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
>> +        range->iova = iova_xlat;
>> +        range->size = size;
>> +
>> +        /*
>> +         * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
>> +         * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
>> +         * TARGET_PAGE_SIZE.
>> +         */
>> +        range->bitmap.pgsize = TARGET_PAGE_SIZE;
>> +
>> +        /*
>> +         * Comment from kvm_physical_sync_dirty_bitmap() since same applies here
>> +         * XXX bad kernel interface alert
>> +         * For dirty bitmap, kernel allocates array of size aligned to
>> +         * bits-per-long.  But for case when the kernel is 64bits and
>> +         * the userspace is 32bits, userspace can't align to the same
>> +         * bits-per-long, since sizeof(long) is different between kernel
>> +         * and user space.  This way, userspace will provide buffer which
>> +         * may be 4 bytes less than the kernel will use, resulting in
>> +         * userspace memory corruption (which is not detectable by valgrind
>> +         * too, in most cases).
>> +         * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
>> +         * a hope that sizeof(long) won't become >8 any time soon.
>> +         */
>> +
>> +        pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
>> +        range->bitmap.size = ROUND_UP(pages, 64) / 8;
>> +        range->bitmap.data = g_malloc0(range->bitmap.size);
>> +        if (range->bitmap.data == NULL) {
>> +            error_report("Error allocating bitmap of size 0x%llx",
>> +                         range->bitmap.size);
>> +            ret = -ENOMEM;
>> +            goto err_out;
>> +        }
>> +
>> +        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
>> +        if (ret) {
>> +            error_report("Failed to get dirty bitmap for iova: 0x%llx "
>> +                         "size: 0x%llx err: %d",
>> +                         range->iova, range->size, errno);
>> +            goto err_out;
>> +        }
>> +
>> +        if (memory_region_is_iommu(section->mr)) {
>> +            if (!vfio_get_xlat_addr(&iotlb, NULL, &start, NULL)) {
>> +                ret = -EINVAL;
>> +                goto err_out;
>> +            }
>> +        } else {
>> +            start = memory_region_get_ram_addr(section->mr) +
>> +                    section->offset_within_region + iova -
>> +                    TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +        }
>> +
>> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
>> +                                               start, pages);
>> +
>> +        trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
>> +                                    range->bitmap.size, start);
>> +err_out:
>> +        g_free(range->bitmap.data);
>> +        g_free(dbitmap);
>> +
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +
>> +        if ((iova + size) < iova) {
>> +            break;
>> +        }
>> +
>> +        iova += size;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void vfio_listerner_log_sync(MemoryListener *listener,
>> +        MemoryRegionSection *section)
>> +{
>> +    if (vfio_listener_skipped_section(section)) {
>> +        return;
>> +    }
>> +
>> +    if (vfio_devices_are_stopped_and_saving()) {
>> +        vfio_get_dirty_bitmap(listener, section);
>> +    }
>> +}
>> +
>>   static const MemoryListener vfio_memory_listener = {
>>       .region_add = vfio_listener_region_add,
>>       .region_del = vfio_listener_region_del,
>> +    .log_sync = vfio_listerner_log_sync,
>>   };
>>   
>>   static void vfio_listener_release(VFIOContainer *container)
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index ac065b559f4e..bc8f35ee9356 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(char *name) " (%s)"
>>   vfio_load_device_config_state(char *name) " (%s)"
>>   vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
>>   vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
>> +vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
>> -- 
>> 2.7.0
>>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-03-24 21:09 ` [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
  2020-03-25 19:56   ` Alex Williamson
  2020-03-26 17:46   ` Dr. David Alan Gilbert
@ 2020-04-07  4:10   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2020-05-04 23:21     ` Kirti Wankhede
  2 siblings, 1 reply; 74+ messages in thread
From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) @ 2020-04-07  4:10 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson
  Cc: kevin.tian, yi.l.liu, yan.y.zhao, felipe, eskultet, ziye.yang,
	Ken.Xue, Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert,
	pasic, aik, Gonglei (Arei),
	eauger, cohuck, jonathan.davies, cjia, mlevitsk, changpeng.liu,
	zhi.a.wang



On 2020/3/25 5:09, Kirti Wankhede wrote:
> These functions save and restore PCI device specific data - config
> space of PCI device.
> Tested save and restore with MSI and MSIX type.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   2 +
>  2 files changed, 165 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 6c77c12e44b9..8deb11e87ef7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -41,6 +41,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "migration/qemu-file.h"
>  
>  #define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    VFIOBAR *bar = &vdev->bars[nr];
> +    uint64_t addr;
> +    uint32_t addr_lo, addr_hi = 0;
> +
> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> +    if (!bar->size) {
> +        return 0;
> +    }
> +
> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> +
> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> +                                       PCI_BASE_ADDRESS_MEM_MASK);
> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +        addr_hi = pci_default_read_config(pdev,
> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> +    }
> +
> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
> +
> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> +{
> +    int i, ret;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        ret = vfio_bar_validate(vdev, i);
> +        if (ret) {
> +            error_report("vfio: BAR address %d validation failed", i);
> +            return ret;
> +        }
> +    }
> +    return 0;
> +}
> +
>  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>  {
>      VFIOBAR *bar = &vdev->bars[nr];
> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>      return OBJECT(vdev);
>  }
>  
> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint16_t pci_cmd;
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar;
> +
> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar);
> +    }
> +
> +    qemu_put_be32(f, vdev->interrupt);
> +    if (vdev->interrupt == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                                            2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        qemu_put_be32(f, msi_addr_lo);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                             4);
> +        }
> +        qemu_put_be32(f, msi_addr_hi);
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +        qemu_put_be32(f, msi_data);
> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> +        uint16_t offset;
> +
> +        /* save enable bit and maskall bit */
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        qemu_put_be16(f, offset);
> +        msix_save(pdev, f);
> +    }
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    qemu_put_be16(f, pci_cmd);
> +}
> +
> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +    uint16_t pci_cmd;
> +    bool msi_64bit;
> +    int i, ret;
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = qemu_get_be32(f);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }
> +
> +    ret = vfio_bars_validate(vdev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +        msi_addr_lo = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        msi_addr_hi = qemu_get_be32(f);
> +        if (msi_64bit) {
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +        msi_data = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset = qemu_get_be16(f);
> +
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +        msix_load(pdev, f);
Hi Kirti, Alex

'msix_load' here may increases the downtime. Our migrate-cap device has 128 msix
interrupts and the guestos enables all of them, so cost a lot of time (nearly
1s) to do 'msix_load', because in 'vfio_msix_vector_do_use' we need disable all
old interuppts and then append a new one.

What's your opinions ?

> +    }
> +    pci_cmd = qemu_get_be16(f);
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +    return 0;
> +}
> +
>  static VFIODeviceOps vfio_pci_ops = {
>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>      .vfio_eoi = vfio_intx_eoi,
>      .vfio_get_object = vfio_pci_get_object,
> +    .vfio_save_config = vfio_pci_save_config,
> +    .vfio_load_config = vfio_pci_load_config,
>  };
>  
>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 74261feaeac9..d69a7f3ae31e 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>      void (*vfio_eoi)(VFIODevice *vdev);
>      Object *(*vfio_get_object)(VFIODevice *vdev);
> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>  };
>  
>  typedef struct VFIOGroup {
> 

-- 
---
Regards,
Longpeng(Mike)


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-03-25 22:03   ` Alex Williamson
@ 2020-05-04 23:18     ` Kirti Wankhede
  2020-05-05  4:37       ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 3/26/2020 3:33 AM, Alex Williamson wrote:
> On Wed, 25 Mar 2020 02:39:07 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>> functions. These functions handles pre-copy and stop-and-copy phase.
>>
>> In _SAVING|_RUNNING device state or pre-copy phase:
>> - read pending_bytes. If pending_bytes > 0, go through below steps.
>> - read data_offset - indicates kernel driver to write data to staging
>>    buffer.
>> - read data_size - amount of data in bytes written by vendor driver in
>>    migration region.
>> - read data_size bytes of data from data_offset in the migration region.
>> - Write data packet to file stream as below:
>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>> VFIO_MIG_FLAG_END_OF_STATE }
>>
>> In _SAVING device state or stop-and-copy phase
>> a. read config space of device and save to migration file stream. This
>>     doesn't need to be from vendor driver. Any other special config state
>>     from driver can be saved as data in following iteration.
>> b. read pending_bytes. If pending_bytes > 0, go through below steps.
>> c. read data_offset - indicates kernel driver to write data to staging
>>     buffer.
>> d. read data_size - amount of data in bytes written by vendor driver in
>>     migration region.
>> e. read data_size bytes of data from data_offset in the migration region.
>> f. Write data packet as below:
>>     {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>> g. iterate through steps b to f while (pending_bytes > 0)
>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>
>> When data region is mapped, its user's responsibility to read data from
>> data_offset of data_size before moving to next steps.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c           | 245 +++++++++++++++++++++++++++++++++++++++++-
>>   hw/vfio/trace-events          |   6 ++
>>   include/hw/vfio/vfio-common.h |   1 +
>>   3 files changed, 251 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 033f76526e49..ecbeed5182c2 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +static void *find_data_region(VFIORegion *region,
>> +                              uint64_t data_offset,
>> +                              uint64_t data_size)
>> +{
>> +    void *ptr = NULL;
>> +    int i;
>> +
>> +    for (i = 0; i < region->nr_mmaps; i++) {
>> +        if ((data_offset >= region->mmaps[i].offset) &&
>> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
>> +            (data_size <= region->mmaps[i].size)) {
> 
> (data_offset - region->mmaps[i].offset) can be non-zero, so this test
> is invalid.  Additionally the uapi does not require that a give data
> chunk fits exclusively within an mmap'd area, it may overlap one or
> more mmap'd sections of the region, possibly with non-mmap'd areas
> included.
> 

What's the advantage of having mmap and non-mmap overlapped regions?
Isn't it better to have data section either mapped or trapped?

>> +            ptr = region->mmaps[i].mmap + (data_offset -
>> +                                           region->mmaps[i].offset);
>> +            break;
>> +        }
>> +    }
>> +    return ptr;
>> +}
>> +
>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    uint64_t data_offset = 0, data_size = 0;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +    if (ret != sizeof(data_offset)) {
>> +        error_report("%s: Failed to get migration buffer data offset %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_size));
>> +    if (ret != sizeof(data_size)) {
>> +        error_report("%s: Failed to get migration buffer data size %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (data_size > 0) {
>> +        void *buf = NULL;
>> +        bool buffer_mmaped;
>> +
>> +        if (region->mmaps) {
>> +            buf = find_data_region(region, data_offset, data_size);
>> +        }
>> +
>> +        buffer_mmaped = (buf != NULL) ? true : false;
> 
> The ternary is unnecessary, "? true : false" is redundant.
> 

Removing it.

>> +
>> +        if (!buffer_mmaped) {
>> +            buf = g_try_malloc0(data_size);
> 
> Why do we need zero'd memory?
> 

Zeroed memory not required, removing 0

>> +            if (!buf) {
>> +                error_report("%s: Error allocating buffer ", __func__);
>> +                return -ENOMEM;
>> +            }
>> +
>> +            ret = pread(vbasedev->fd, buf, data_size,
>> +                        region->fd_offset + data_offset);
>> +            if (ret != data_size) {
>> +                error_report("%s: Failed to get migration data %d",
>> +                             vbasedev->name, ret);
>> +                g_free(buf);
>> +                return -EINVAL;
>> +            }
>> +        }
>> +
>> +        qemu_put_be64(f, data_size);
>> +        qemu_put_buffer(f, buf, data_size);
> 
> This can segfault when mmap'd given the above assumptions about size
> and layout.
> 
>> +
>> +        if (!buffer_mmaped) {
>> +            g_free(buf);
>> +        }
>> +    } else {
>> +        qemu_put_be64(f, data_size);
> 
> We insert a zero?  Couldn't we add the section header and end here and
> skip it entirely?
> 

This is used during resuming, data_size 0 indicates end of data.

>> +    }
>> +
>> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
>> +                           migration->pending_bytes);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return data_size;
>> +}
>> +
>> +static int vfio_update_pending(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    uint64_t pending_bytes = 0;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             pending_bytes));
>> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
>> +        error_report("%s: Failed to get pending bytes %d",
>> +                     vbasedev->name, ret);
>> +        migration->pending_bytes = 0;
>> +        return (ret < 0) ? ret : -EINVAL;
>> +    }
>> +
>> +    migration->pending_bytes = pending_bytes;
>> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>> +
>> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
>> +        vbasedev->ops->vfio_save_config(vbasedev, f);
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    trace_vfio_save_device_config_state(vbasedev->name);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -154,7 +285,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>           qemu_mutex_unlock_iothread();
>>           if (ret) {
>>               error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> -                         vbasedev->name, migration->region.index,
>> +                         vbasedev->name, migration->region.nr,
>>                            strerror(-ret));
>>               return ret;
>>           }
>> @@ -194,9 +325,121 @@ static void vfio_save_cleanup(void *opaque)
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>>   
>> +static void vfio_save_pending(QEMUFile *f, void *opaque,
>> +                              uint64_t threshold_size,
>> +                              uint64_t *res_precopy_only,
>> +                              uint64_t *res_compatible,
>> +                              uint64_t *res_postcopy_only)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return;
>> +    }
>> +
>> +    *res_precopy_only += migration->pending_bytes;
>> +
>> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
>> +                            *res_postcopy_only, *res_compatible);
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    int ret, data_size;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +
>> +    data_size = vfio_save_buffer(f, vbasedev);
>> +
>> +    if (data_size < 0) {
>> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
>> +                     strerror(errno));
>> +        return data_size;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_iterate(vbasedev->name, data_size);
>> +    if (data_size == 0) {
>> +        /* indicates data finished, goto complete phase */
>> +        return 1;
> 
> But it's pending_bytes not data_size that indicates we're done.  How do
> we get away with ignoring pending_bytes for the save_live_iterate phase?
> 

This is requirement mentioned above qemu_savevm_state_iterate() which 
calls .save_live_iterate.

/*	
  * this function has three return values:
  *   negative: there was one error, and we have -errno.
  *   0 : We haven't finished, caller have to go again
  *   1 : We have finished, we can go to complete phase
  */
int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)

This is to serialize savevm_state.handlers (or in other words devices).

Thanks,
Kirti

>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
>> +                                   VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state STOP and SAVING",
>> +                     vbasedev->name);
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_save_device_config_state(f, opaque);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    while (migration->pending_bytes > 0) {
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +        ret = vfio_save_buffer(f, vbasedev);
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to save buffer", vbasedev->name);
>> +            return ret;
>> +        } else if (ret == 0) {
>> +            break;
>> +        }
>> +
>> +        ret = vfio_update_pending(vbasedev);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_complete_precopy(vbasedev->name);
>> +    return ret;
>> +}
>> +
>>   static SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>> +    .save_live_pending = vfio_save_pending,
>> +    .save_live_iterate = vfio_save_iterate,
>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>>   };
>>   
>>   /* ---------------------------------------------------------------------- */
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 4bb43f18f315..bdf40ba368c7 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -151,3 +151,9 @@ vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_st
>>   vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>>   vfio_save_setup(char *name) " (%s)"
>>   vfio_save_cleanup(char *name) " (%s)"
>> +vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
>> +vfio_update_pending(char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
>> +vfio_save_device_config_state(char *name) " (%s)"
>> +vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>> +vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
>> +vfio_save_complete_precopy(char *name) " (%s)"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 28f55f66d019..c78033e4149d 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -60,6 +60,7 @@ typedef struct VFIORegion {
>>   
>>   typedef struct VFIOMigration {
>>       VFIORegion region;
>> +    uint64_t pending_bytes;
>>   } VFIOMigration;
>>   
>>   typedef struct VFIOAddressSpace {
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-03-25 19:56   ` Alex Williamson
  2020-03-26 17:29     ` Dr. David Alan Gilbert
@ 2020-05-04 23:18     ` Kirti Wankhede
  2020-05-05  4:37       ` Alex Williamson
  1 sibling, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 3/26/2020 1:26 AM, Alex Williamson wrote:
> On Wed, 25 Mar 2020 02:39:02 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> These functions save and restore PCI device specific data - config
>> space of PCI device.
>> Tested save and restore with MSI and MSIX type.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/vfio/vfio-common.h |   2 +
>>   2 files changed, 165 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 6c77c12e44b9..8deb11e87ef7 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -41,6 +41,7 @@
>>   #include "trace.h"
>>   #include "qapi/error.h"
>>   #include "migration/blocker.h"
>> +#include "migration/qemu-file.h"
>>   
>>   #define TYPE_VFIO_PCI "vfio-pci"
>>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>>       }
>>   }
>>   
>> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
>> +{
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    VFIOBAR *bar = &vdev->bars[nr];
>> +    uint64_t addr;
>> +    uint32_t addr_lo, addr_hi = 0;
>> +
>> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
>> +    if (!bar->size) {
>> +        return 0;
>> +    }
>> +
>> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
>> +
>> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
>> +                                       PCI_BASE_ADDRESS_MEM_MASK);
> 
> Nit, &= or combine with previous set.
> 
>> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
>> +        addr_hi = pci_default_read_config(pdev,
>> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
>> +    }
>> +
>> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
> 
> Could we use a union?
> 
>> +
>> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
>> +        return -EINVAL;
>> +    }
> 
> What specifically are we validating here?  This should be true no
> matter what we wrote to the BAR or else BAR emulation is broken.  The
> bits that could make this unaligned are not implemented in the BAR.
> 
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
>> +{
>> +    int i, ret;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        ret = vfio_bar_validate(vdev, i);
>> +        if (ret) {
>> +            error_report("vfio: BAR address %d validation failed", i);
>> +            return ret;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>>   static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>   {
>>       VFIOBAR *bar = &vdev->bars[nr];
>> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>       return OBJECT(vdev);
>>   }
>>   
>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint16_t pci_cmd;
>> +    int i;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar;
>> +
>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>> +        qemu_put_be32(f, bar);
>> +    }
>> +
>> +    qemu_put_be32(f, vdev->interrupt);
>> +    if (vdev->interrupt == VFIO_INT_MSI) {
>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +        bool msi_64bit;
>> +
>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                                            2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        msi_addr_lo = pci_default_read_config(pdev,
>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>> +        qemu_put_be32(f, msi_addr_lo);
>> +
>> +        if (msi_64bit) {
>> +            msi_addr_hi = pci_default_read_config(pdev,
>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                             4);
>> +        }
>> +        qemu_put_be32(f, msi_addr_hi);
>> +
>> +        msi_data = pci_default_read_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                2);
>> +        qemu_put_be32(f, msi_data);
> 
> Isn't the data field only a u16?
> 

Yes, fixing it.

>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
>> +        uint16_t offset;
>> +
>> +        /* save enable bit and maskall bit */
>> +        offset = pci_default_read_config(pdev,
>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>> +        qemu_put_be16(f, offset);
>> +        msix_save(pdev, f);
>> +    }
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    qemu_put_be16(f, pci_cmd);
>> +}
>> +
>> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint32_t interrupt_type;
>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +    uint16_t pci_cmd;
>> +    bool msi_64bit;
>> +    int i, ret;
>> +
>> +    /* retore pci bar configuration */
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar = qemu_get_be32(f);
>> +
>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>> +    }
>> +
>> +    ret = vfio_bars_validate(vdev);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    interrupt_type = qemu_get_be32(f);
>> +
>> +    if (interrupt_type == VFIO_INT_MSI) {
>> +        /* restore msi configuration */
>> +        msi_flags = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>> +
>> +        msi_addr_lo = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>> +                              msi_addr_lo, 4);
>> +
>> +        msi_addr_hi = qemu_get_be32(f);
>> +        if (msi_64bit) {
>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                  msi_addr_hi, 4);
>> +        }
>> +        msi_data = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                msi_data, 2);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>> +        uint16_t offset = qemu_get_be16(f);
>> +
>> +        /* load enable bit and maskall bit */
>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>> +                              offset, 2);
>> +        msix_load(pdev, f);
>> +    }
>> +    pci_cmd = qemu_get_be16(f);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
>> +    return 0;
>> +}
> 
> It always seems like there should be a lot more state than this, and I
> probably sound like a broken record because I ask every time, but maybe
> that's a good indication that we (or at least I) need a comment
> explaining why we only care about these.  For example, what if we
> migrate a device in the D3 power state, don't we need to account for
> the state stored in the PM capability or does the device wake up into
> D0 auto-magically after migration?  I think we could repeat that
> question for every capability that can be modified.  Even for the MSI/X
> cases, the interrupt may not be active, but there could be state in
> virtual config space that would be different on the target.  For
> example, if we migrate with a device in INTx mode where the guest had
> written vector fields on the source, but only writes the enable bit on
> the target, can we seamlessly figure out the rest?  For other
> capabilities, that state may represent config space changes written
> through to the physical device and represent a functional difference on
> the target.  Thanks,
>

These are very basic set of registers from config state. Other are more 
of vendor specific which vendor driver can save and restore in their own 
data. I don't think we have to take care of all those vendor specific 
fields here.

Thanks,
Kirti

> Alex
> 
>> +
>>   static VFIODeviceOps vfio_pci_ops = {
>>       .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>>       .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>>       .vfio_eoi = vfio_intx_eoi,
>>       .vfio_get_object = vfio_pci_get_object,
>> +    .vfio_save_config = vfio_pci_save_config,
>> +    .vfio_load_config = vfio_pci_load_config,
>>   };
>>   
>>   int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 74261feaeac9..d69a7f3ae31e 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>>       int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>>       void (*vfio_eoi)(VFIODevice *vdev);
>>       Object *(*vfio_get_object)(VFIODevice *vdev);
>> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
>> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>>   };
>>   
>>   typedef struct VFIOGroup {
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-03-25 21:02   ` Alex Williamson
@ 2020-05-04 23:19     ` Kirti Wankhede
  2020-05-05  4:37       ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 3/26/2020 2:32 AM, Alex Williamson wrote:
> On Wed, 25 Mar 2020 02:39:06 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Define flags to be used as delimeter in migration file stream.
>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>> region from these functions at source during saving or pre-copy phase.
>> Set VFIO device state depending on VM's state. During live migration, VM is
>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |  2 ++
>>   2 files changed, 78 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 22ded9d28cf3..033f76526e49 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -8,6 +8,7 @@
>>    */
>>   
>>   #include "qemu/osdep.h"
>> +#include "qemu/main-loop.h"
>>   #include <linux/vfio.h>
>>   
>>   #include "sysemu/runstate.h"
>> @@ -24,6 +25,17 @@
>>   #include "pci.h"
>>   #include "trace.h"
>>   
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>> +
>>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>> @@ -126,6 +138,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +/* ---------------------------------------------------------------------- */
>> +
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +
>> +    if (migration->region.mmaps) {
>> +        qemu_mutex_lock_iothread();
>> +        ret = vfio_region_mmap(&migration->region);
>> +        qemu_mutex_unlock_iothread();
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.index,
>> +                         strerror(-ret));
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
>> +        return ret;
>> +    }
>> +
>> +    /*
>> +     * Save migration region size. This is used to verify migration region size
>> +     * is greater than or equal to migration region size at destination
>> +     */
>> +    qemu_put_be64(f, migration->region.size);
> 
> Is this requirement supported by the uapi?  

Yes, on UAPI thread we discussed this:

  * For the user application, data is opaque. The user application 
should write
  * data in the same order as the data is received and the data should be of
  * same transaction size at the source.

data should be same transaction size, so migration region size should be 
greater than or equal to the size at source when verifying at destination.

> The vendor driver operates
> within the migration region, but it has no requirement to use the full
> extent of the region.  Shouldn't we instead insert the version string
> from versioning API Yan proposed?  Is this were we might choose to use
> an interface via the vfio API rather than sysfs if we had one?
>

VFIO API cannot be used by libvirt or management tool stack. We need 
sysfs as Yan proposed to be used by libvirt or management tool stack.

Thanks,
Kirti

>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_setup(vbasedev->name);
>> +    return 0;
>> +}
>> +
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->region.mmaps) {
>> +        vfio_region_unmap(&migration->region);
>> +    }
>> +    trace_vfio_save_cleanup(vbasedev->name);
>> +}
>> +
>> +static SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +};
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>>   static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -191,6 +266,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>           return ret;
>>       }
>>   
>> +    register_savevm_live("vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                             vbasedev);
>>   
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 69503228f20e..4bb43f18f315 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>>   vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>>   vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>   vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>> +vfio_save_setup(char *name) " (%s)"
>> +vfio_save_cleanup(char *name) " (%s)"
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-03-26 17:46   ` Dr. David Alan Gilbert
@ 2020-05-04 23:19     ` Kirti Wankhede
  0 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 3/26/2020 11:16 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> These functions save and restore PCI device specific data - config
>> space of PCI device.
>> Tested save and restore with MSI and MSIX type.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/vfio/vfio-common.h |   2 +
>>   2 files changed, 165 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 6c77c12e44b9..8deb11e87ef7 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -41,6 +41,7 @@
>>   #include "trace.h"
>>   #include "qapi/error.h"
>>   #include "migration/blocker.h"
>> +#include "migration/qemu-file.h"
>>   
>>   #define TYPE_VFIO_PCI "vfio-pci"
>>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>>       }
>>   }
>>   
>> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
>> +{
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    VFIOBAR *bar = &vdev->bars[nr];
>> +    uint64_t addr;
>> +    uint32_t addr_lo, addr_hi = 0;
>> +
>> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
>> +    if (!bar->size) {
>> +        return 0;
>> +    }
>> +
>> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
>> +
>> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
>> +                                       PCI_BASE_ADDRESS_MEM_MASK);
>> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
>> +        addr_hi = pci_default_read_config(pdev,
>> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
>> +    }
>> +
>> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
>> +
>> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
>> +{
>> +    int i, ret;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        ret = vfio_bar_validate(vdev, i);
>> +        if (ret) {
>> +            error_report("vfio: BAR address %d validation failed", i);
>> +            return ret;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>>   static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>   {
>>       VFIOBAR *bar = &vdev->bars[nr];
>> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>       return OBJECT(vdev);
>>   }
>>   
>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint16_t pci_cmd;
>> +    int i;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar;
>> +
>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>> +        qemu_put_be32(f, bar);
>> +    }
>> +
>> +    qemu_put_be32(f, vdev->interrupt);
>> +    if (vdev->interrupt == VFIO_INT_MSI) {
>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +        bool msi_64bit;
>> +
>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                                            2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        msi_addr_lo = pci_default_read_config(pdev,
>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>> +        qemu_put_be32(f, msi_addr_lo);
>> +
>> +        if (msi_64bit) {
>> +            msi_addr_hi = pci_default_read_config(pdev,
>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                             4);
>> +        }
>> +        qemu_put_be32(f, msi_addr_hi);
>> +
>> +        msi_data = pci_default_read_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                2);
>> +        qemu_put_be32(f, msi_data);
>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
>> +        uint16_t offset;
>> +
>> +        /* save enable bit and maskall bit */
>> +        offset = pci_default_read_config(pdev,
>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>> +        qemu_put_be16(f, offset);
>> +        msix_save(pdev, f);
>> +    }
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    qemu_put_be16(f, pci_cmd);
>> +}
>> +
>> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint32_t interrupt_type;
>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +    uint16_t pci_cmd;
>> +    bool msi_64bit;
>> +    int i, ret;
>> +
>> +    /* retore pci bar configuration */
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar = qemu_get_be32(f);
>> +
>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>> +    }
>> +
>> +    ret = vfio_bars_validate(vdev);
> 
> This isn't quite what I'd expected, since that validate is reading what
> you read back; I'd have thought you'd validate the bar value before
> writing it to the device.
> (I'm also surprised you're only reading 32bit here?)
> 

Sorry, then I didn't understood what exactly validation need to done here.

Reading 32-bit here because all PCI_BASE_ADDRESS_* are 32-bit registers. 
When BARs are 64-bit, adjacent bar addresses are clubbed to use hi and 
low. For example if BAR0 is 32-bit and BAR1 is 64 bit, 
PCI_BASE_ADDRESS_0 is used for BAR0. PCI_BASE_ADDRESS_1 is BAR1_lo and 
PCI_BASE_ADDRESS_2 is BAR1_hi address, then PCI_BASE_ADDRESS_3 will have 
  BAR2 address.This is according to PCIe specs.


>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    interrupt_type = qemu_get_be32(f);
>> +
>> +    if (interrupt_type == VFIO_INT_MSI) {
>> +        /* restore msi configuration */
>> +        msi_flags = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>> +
>> +        msi_addr_lo = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>> +                              msi_addr_lo, 4);
>> +
>> +        msi_addr_hi = qemu_get_be32(f);
>> +        if (msi_64bit) {
>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                  msi_addr_hi, 4);
>> +        }
>> +        msi_data = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                msi_data, 2);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>> +        uint16_t offset = qemu_get_be16(f);
>> +
>> +        /* load enable bit and maskall bit */
>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>> +                              offset, 2);
>> +        msix_load(pdev, f);
>> +    }
>> +    pci_cmd = qemu_get_be16(f);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
>> +    return 0;
>> +}
>> +
> 
> While I don't know PCI as well as Alex, I share the worry about what
> happens when you decide to want to save more information about the
> device; you've not got any place holders where you can add anything; and
> since it's all hand-coded (rather than using vmstate) it's only going to
> get hairier.
> 

These are very basic registers of PCI config space. When there are more, 
which are generally vendor specific, vendor driver should take care of 
saving and restoring rest of the data in config space.

Thanks,
Kirti

> Dave
> 
>>   static VFIODeviceOps vfio_pci_ops = {
>>       .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>>       .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>>       .vfio_eoi = vfio_intx_eoi,
>>       .vfio_get_object = vfio_pci_get_object,
>> +    .vfio_save_config = vfio_pci_save_config,
>> +    .vfio_load_config = vfio_pci_load_config,
>>   };
>>   
>>   int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 74261feaeac9..d69a7f3ae31e 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>>       int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>>       void (*vfio_eoi)(VFIODevice *vdev);
>>       Object *(*vfio_get_object)(VFIODevice *vdev);
>> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
>> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>>   };
>>   
>>   typedef struct VFIOGroup {
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 05/16] vfio: Add migration region initialization and finalize function
  2020-03-26 17:52   ` Dr. David Alan Gilbert
@ 2020-05-04 23:19     ` Kirti Wankhede
  2020-05-19 19:32       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 3/26/2020 11:22 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> - Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
>>    patch series.
>> - VFIO device supports migration or not is decided based of migration region
>>    query. If migration region query is successful and migration region
>>    initialization is successful then migration is supported else migration is
>>    blocked.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/Makefile.objs         |   2 +-
>>   hw/vfio/migration.c           | 138 ++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |   3 +
>>   include/hw/vfio/vfio-common.h |   9 +++
>>   4 files changed, 151 insertions(+), 1 deletion(-)
>>   create mode 100644 hw/vfio/migration.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index 9bb1c09e8477..8b296c889ed9 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,4 +1,4 @@
>> -obj-y += common.o spapr.o
>> +obj-y += common.o spapr.o migration.o
>>   obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
>>   obj-$(CONFIG_VFIO_CCW) += ccw.o
>>   obj-$(CONFIG_VFIO_PLATFORM) += platform.o
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> new file mode 100644
>> index 000000000000..a078dcf1dd8f
>> --- /dev/null
>> +++ b/hw/vfio/migration.c
>> @@ -0,0 +1,138 @@
>> +/*
>> + * Migration support for VFIO devices
>> + *
>> + * Copyright NVIDIA, Inc. 2019
> 
> Time flies by...
> 
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2. See
>> + * the COPYING file in the top-level directory.
> 
> Are you sure you want this to be V2 only? Most code added to qemu now is
> v2 or later.
> 

I kept it same as in files vfio-pci and hw/vfio/common.c

Should it be different? Can you give some reference what it should be?

Thanks,
Kirti

>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "cpu.h"
>> +#include "migration/migration.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/register.h"
>> +#include "migration/blocker.h"
>> +#include "migration/misc.h"
>> +#include "qapi/error.h"
>> +#include "exec/ramlist.h"
>> +#include "exec/ram_addr.h"
>> +#include "pci.h"
>> +#include "trace.h"
>> +
>> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (!migration) {
>> +        return;
>> +    }
>> +
>> +    if (migration->region.size) {
>> +        vfio_region_exit(&migration->region);
>> +        vfio_region_finalize(&migration->region);
>> +    }
>> +}
>> +
>> +static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    Object *obj = NULL;
>> +    int ret = -EINVAL;
>> +
>> +    if (!vbasedev->ops->vfio_get_object) {
>> +        return ret;
>> +    }
>> +
>> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
>> +    if (!obj) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
>> +                            "migration");
>> +    if (ret) {
>> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
>> +                     vbasedev->name, index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    if (!migration->region.size) {
>> +        ret = -EINVAL;
>> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
>> +                     vbasedev->name, index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    return 0;
>> +
>> +err:
>> +    vfio_migration_region_exit(vbasedev);
>> +    return ret;
>> +}
>> +
>> +static int vfio_migration_init(VFIODevice *vbasedev,
>> +                               struct vfio_region_info *info)
>> +{
>> +    int ret;
>> +
>> +    vbasedev->migration = g_new0(VFIOMigration, 1);
>> +
>> +    ret = vfio_migration_region_init(vbasedev, info->index);
>> +    if (ret) {
>> +        error_report("%s: Failed to initialise migration region",
>> +                     vbasedev->name);
>> +        g_free(vbasedev->migration);
>> +        vbasedev->migration = NULL;
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    struct vfio_region_info *info;
>> +    Error *local_err = NULL;
>> +    int ret;
>> +
>> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
>> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
>> +    if (ret) {
>> +        goto add_blocker;
>> +    }
>> +
>> +    ret = vfio_migration_init(vbasedev, info);
>> +    if (ret) {
>> +        goto add_blocker;
>> +    }
>> +
>> +    trace_vfio_migration_probe(vbasedev->name, info->index);
>> +    return 0;
>> +
>> +add_blocker:
>> +    error_setg(&vbasedev->migration_blocker,
>> +               "VFIO device doesn't support migration");
>> +    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        error_free(vbasedev->migration_blocker);
>> +    }
>> +    return ret;
>> +}
>> +
>> +void vfio_migration_finalize(VFIODevice *vbasedev)
>> +{
>> +    if (vbasedev->migration_blocker) {
>> +        migrate_del_blocker(vbasedev->migration_blocker);
>> +        error_free(vbasedev->migration_blocker);
>> +    }
>> +
>> +    vfio_migration_region_exit(vbasedev);
>> +    g_free(vbasedev->migration);
>> +}
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 8cdc27946cb8..191a726a1312 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
>>   vfio_display_edid_link_down(void) ""
>>   vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>>   vfio_display_edid_write_error(void) ""
>> +
>> +# migration.c
>> +vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index d69a7f3ae31e..d4b268641173 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -57,6 +57,10 @@ typedef struct VFIORegion {
>>       uint8_t nr; /* cache the region number for debug */
>>   } VFIORegion;
>>   
>> +typedef struct VFIOMigration {
>> +    VFIORegion region;
>> +} VFIOMigration;
>> +
>>   typedef struct VFIOAddressSpace {
>>       AddressSpace *as;
>>       QLIST_HEAD(, VFIOContainer) containers;
>> @@ -113,6 +117,8 @@ typedef struct VFIODevice {
>>       unsigned int num_irqs;
>>       unsigned int num_regions;
>>       unsigned int flags;
>> +    VFIOMigration *migration;
>> +    Error *migration_blocker;
>>   } VFIODevice;
>>   
>>   struct VFIODeviceOps {
>> @@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>>   int vfio_spapr_remove_window(VFIOContainer *container,
>>                                hwaddr offset_within_address_space);
>>   
>> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
>> +void vfio_migration_finalize(VFIODevice *vbasedev);
>> +
>>   #endif /* HW_VFIO_VFIO_COMMON_H */
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking
  2020-03-26 19:10   ` Alex Williamson
@ 2020-05-04 23:20     ` Kirti Wankhede
  0 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 3/27/2020 12:40 AM, Alex Williamson wrote:
> On Wed, 25 Mar 2020 02:39:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
>> for VFIO devices.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> ---
>>   hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 36 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index ab295d25620e..1827b7cfb316 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -9,6 +9,7 @@
>>   
>>   #include "qemu/osdep.h"
>>   #include "qemu/main-loop.h"
>> +#include <sys/ioctl.h>
>>   #include <linux/vfio.h>
>>   
>>   #include "sysemu/runstate.h"
>> @@ -296,6 +297,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>>   
>> +static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
>> +{
>> +    int ret;
>> +    VFIOContainer *container = vbasedev->group->container;
>> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
>> +        .argsz = sizeof(dirty),
>> +    };
>> +
>> +    if (start) {
>> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
>> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
>> +        } else {
>> +            return 0;
>> +        }
>> +    } else {
>> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
>> +    }
> 
> Dirty logging and device saving are logically separate, why do we link
> them here?
> 

Dirty logging is associated with migration state and in vfio case we get 
to know that migration state for per device. We don't know which device 
is first or last. So start dirty page logging .save_setup. But this 
function can be called from other places also, so for sanity check start 
dirty pages tracking only when VFIO_DEVICE_STATE_SAVING flag is set.

> Why do we return success when we want to start logging if we haven't
> started logging?
> 

It should be -EINVAL since dirty page tracking shouldn't start if 
VFIO_DEVICE_STATE_SAVING flag is not set, i.e. devices are not in SAVING 
state.

>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
>> +    if (ret) {
>> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
>> +                     dirty.flags, errno);
>> +    }
>> +    return ret;
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -330,6 +357,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>        */
>>       qemu_put_be64(f, migration->region.size);
>>   
>> +    ret = vfio_start_dirty_page_tracking(vbasedev, true);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
> 
> Haven't we corrupted the migration stream by exiting here?  Maybe this
> implies the entire migration fails, therefore we don't need to add the
> end marker?  Thanks,
> 

If returned error here means migration fails.

Thanks,
Kirti

> Alex
> 
>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>   
>>       ret = qemu_file_get_error(f);
>> @@ -346,6 +378,8 @@ static void vfio_save_cleanup(void *opaque)
>>       VFIODevice *vbasedev = opaque;
>>       VFIOMigration *migration = vbasedev->migration;
>>   
>> +    vfio_start_dirty_page_tracking(vbasedev, false);
>> +
>>       if (migration->region.mmaps) {
>>           vfio_region_unmap(&migration->region);
>>       }
>> @@ -669,6 +703,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>>           if (ret) {
>>               error_report("%s: Failed to set state RUNNING", vbasedev->name);
>>           }
>> +
>> +        vfio_start_dirty_page_tracking(vbasedev, false);
>>       }
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 07/16] vfio: Add migration state change notifier
  2020-04-01 11:27   ` Dr. David Alan Gilbert
@ 2020-05-04 23:20     ` Kirti Wankhede
  0 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 4/1/2020 4:57 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> Added migration state change notifier to get notification on migration state
>> change. These states are translated to VFIO device state and conveyed to vendor
>> driver.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |  1 +
>>   include/hw/vfio/vfio-common.h |  1 +
>>   3 files changed, 31 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index af9443c275fb..22ded9d28cf3 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -154,6 +154,27 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>       }
>>   }
>>   
>> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>> +{
>> +    MigrationState *s = data;
>> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
>> +    int ret;
>> +
>> +    trace_vfio_migration_state_notifier(vbasedev->name, s->state);
> 
> You might want to use MigrationStatus_str(s->status) to make that
> readable.
> 

Yes.

>> +    switch (s->state) {
>> +    case MIGRATION_STATUS_CANCELLING:
>> +    case MIGRATION_STATUS_CANCELLED:
>> +    case MIGRATION_STATUS_FAILED:
>> +        ret = vfio_migration_set_state(vbasedev,
>> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
>> +                      VFIO_DEVICE_STATE_RUNNING);
>> +        if (ret) {
>> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
>> +        }
> 
> In the migration code we check to see if the VM was running prior to the
> start of the migration before we start the CPUs going again (see
> migration_iteration_finish):
>      case MIGRATION_STATUS_FAILED:
>      case MIGRATION_STATUS_CANCELLED:
>      case MIGRATION_STATUS_CANCELLING:
>          if (s->vm_was_running) {
>              vm_start();
>          } else {
>              if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
>                  runstate_set(RUN_STATE_POSTMIGRATE);
>              }
> 
> so if the guest was paused before a migration we don't falsely restart
> it.  Maybe you need something similar?
> 

Guest paused means vCPUs are paused, but that doesn't pause device. Init 
state of VFIO device is also RUNNING and device will not get any 
instructions until vCPUs are running. So I think putting device in 
RUNNING is still fine.

Thanks,
Kirti

> Dave
> 
>> +    }
>> +}
>> +
>>   static int vfio_migration_init(VFIODevice *vbasedev,
>>                                  struct vfio_region_info *info)
>>   {
>> @@ -173,6 +194,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                             vbasedev);
>>   
>> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
>> +    add_migration_state_change_notifier(&vbasedev->migration_state);
>> +
>>       return 0;
>>   }
>>   
>> @@ -211,6 +235,11 @@ add_blocker:
>>   
>>   void vfio_migration_finalize(VFIODevice *vbasedev)
>>   {
>> +
>> +    if (vbasedev->migration_state.notify) {
>> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
>> +    }
>> +
>>       if (vbasedev->vm_state) {
>>           qemu_del_vm_change_state_handler(vbasedev->vm_state);
>>       }
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 3d15bacd031a..69503228f20e 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -148,3 +148,4 @@ vfio_display_edid_write_error(void) ""
>>   vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>>   vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>>   vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>> +vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 3d18eb146b33..28f55f66d019 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>>       VMChangeStateEntry *vm_state;
>>       uint32_t device_state;
>>       int vm_running;
>> +    Notifier migration_state;
>>   } VFIODevice;
>>   
>>   struct VFIODeviceOps {
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-04-01 17:36   ` Dr. David Alan Gilbert
@ 2020-05-04 23:20     ` Kirti Wankhede
  0 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 4/1/2020 11:06 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> Define flags to be used as delimeter in migration file stream.
>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>> region from these functions at source during saving or pre-copy phase.
>> Set VFIO device state depending on VM's state. During live migration, VM is
>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |  2 ++
>>   2 files changed, 78 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 22ded9d28cf3..033f76526e49 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -8,6 +8,7 @@
>>    */
>>   
>>   #include "qemu/osdep.h"
>> +#include "qemu/main-loop.h"
>>   #include <linux/vfio.h>
>>   
>>   #include "sysemu/runstate.h"
>> @@ -24,6 +25,17 @@
>>   #include "pci.h"
>>   #include "trace.h"
>>   
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>> +
>>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>> @@ -126,6 +138,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +/* ---------------------------------------------------------------------- */
>> +
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +
>> +    if (migration->region.mmaps) {
>> +        qemu_mutex_lock_iothread();
>> +        ret = vfio_region_mmap(&migration->region);
>> +        qemu_mutex_unlock_iothread();
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.index,
>> +                         strerror(-ret));
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
>> +        return ret;
>> +    }
>> +
>> +    /*
>> +     * Save migration region size. This is used to verify migration region size
>> +     * is greater than or equal to migration region size at destination
>> +     */
>> +    qemu_put_be64(f, migration->region.size);
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> 
> OK, good, so now we can change that to something else if you want to
> migrate something extra in the future.
> 
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_setup(vbasedev->name);
> 
> I'd put that trace at the start of the function.
> 
>> +    return 0;
>> +}
>> +
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->region.mmaps) {
>> +        vfio_region_unmap(&migration->region);
>> +    }
>> +    trace_vfio_save_cleanup(vbasedev->name);
>> +}
>> +
>> +static SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +};
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>>   static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -191,6 +266,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>           return ret;
>>       }
>>   
>> +    register_savevm_live("vfio", -1, 1, &savevm_vfio_handlers, vbasedev);
> 
> That doesn't look right to me;  firstly the -1 should now be
> VMSTATE_INSTANCE_ID_ANY - after the recent change in commit 1df2c9a
> 
> Have you tried this with two vfio devices?

Yes. And it works with multiple vfio devices.

Thanks,
Kirti

> This is quite rare - it's an iterative device that can have
> multiple instances;  if you look at 'ram' for example, all the RAM
> instances are handled inside the save_setup/save for the one instance of
> 'ram'.  I think here you're trying to register an individual vfio
> device, so if you had multiple devices you'd see this called twice.
> 
> So either you need to make vfio_save_* do all of the devices in a loop -
> which feels like a bad idea;  or replace "vfio" in that call by a unique
> device name;  as long as your device has a bus path then you should be
> able to use the same trick vmstate_register_with_alias_id does, and use
> I think,  vmstate_if_get_id(VMSTAETE_IF(vbasedev)).
> 
> but it might take some experimentation since this is an odd use.
> 
>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                             vbasedev);
>>   
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 69503228f20e..4bb43f18f315 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -149,3 +149,5 @@ vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
>>   vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>>   vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>   vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>> +vfio_save_setup(char *name) " (%s)"
>> +vfio_save_cleanup(char *name) " (%s)"
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 10/16] vfio: Add load state functions to SaveVMHandlers
  2020-04-01 18:58   ` Dr. David Alan Gilbert
@ 2020-05-04 23:20     ` Kirti Wankhede
  0 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 4/2/2020 12:28 AM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> Sequence  during _RESUMING device state:
>> While data for this device is available, repeat below steps:
>> a. read data_offset from where user application should write data.
>> b. write data of data_size to migration region from data_offset.
>> c. write data_size which indicates vendor driver that data is written in
>>     staging buffer.
>>
>> For user, data is opaque. User should write data in the same order as
>> received.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 179 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |   3 +
>>   2 files changed, 182 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index ecbeed5182c2..ab295d25620e 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -269,6 +269,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>>   
>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    uint64_t data;
>> +
>> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>> +        int ret;
>> +
>> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
>> +        if (ret) {
>> +            error_report("%s: Failed to load device config space",
>> +                         vbasedev->name);
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    data = qemu_get_be64(f);
>> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
>> +        error_report("%s: Failed loading device config space, "
>> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
>> +        return -EINVAL;
>> +    }
>> +
>> +    trace_vfio_load_device_config_state(vbasedev->name);
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -434,12 +461,164 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>       return ret;
>>   }
>>   
>> +static int vfio_load_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret = 0;
>> +
>> +    if (migration->region.mmaps) {
>> +        ret = vfio_region_mmap(&migration->region);
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.nr,
>> +                         strerror(-ret));
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_RESUMING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
>> +    }
>> +    return ret;
>> +}
>> +
>> +static int vfio_load_cleanup(void *opaque)
>> +{
>> +    vfio_save_cleanup(opaque);
>> +    return 0;
>> +}
>> +
>> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret = 0;
>> +    uint64_t data, data_size;
>> +
>> +    data = qemu_get_be64(f);
>> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
>> +
>> +        trace_vfio_load_state(vbasedev->name, data);
>> +
>> +        switch (data) {
>> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>> +        {
>> +            ret = vfio_load_device_config_state(f, opaque);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +            break;
>> +        }
>> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>> +        {
>> +            uint64_t region_size = qemu_get_be64(f);
>> +
>> +            if (migration->region.size < region_size) {
>> +                error_report("%s: SETUP STATE: migration region too small, "
>> +                             "0x%"PRIx64 " < 0x%"PRIx64, vbasedev->name,
>> +                             migration->region.size, region_size);
>> +                return -EINVAL;
>> +            }
>> +
>> +            data = qemu_get_be64(f);
>> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> 
> Can you explain why you're reading this here rather than letting it drop
> through to the read at the end of the loop?
> 

To make sure sequence is followed, otherwise throw error.

>> +                return ret;
>> +            } else {
>> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
>> +                             vbasedev->name, data);
>> +                return -EINVAL;
>> +            }
>> +            break;
>> +        }
>> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
>> +        {
>> +            VFIORegion *region = &migration->region;
>> +            void *buf = NULL;
>> +            bool buffer_mmaped = false;
>> +            uint64_t data_offset = 0;
>> +
>> +            data_size = qemu_get_be64(f);
>> +            if (data_size == 0) {
>> +                break;
>> +            }
>> +
>> +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                        region->fd_offset +
>> +                        offsetof(struct vfio_device_migration_info,
>> +                        data_offset));
>> +            if (ret != sizeof(data_offset)) {
>> +                error_report("%s:Failed to get migration buffer data offset %d",
>> +                             vbasedev->name, ret);
>> +                return -EINVAL;
>> +            }
>> +
>> +            if (region->mmaps) {
>> +                buf = find_data_region(region, data_offset, data_size);
>> +            }
>> +
>> +            buffer_mmaped = (buf != NULL) ? true : false;
>> +
>> +            if (!buffer_mmaped) {
>> +                buf = g_try_malloc0(data_size);
> 
> data_size has been read off the wire at this point; can we sanity check
> it?
> 

I do added a check above (data_size == 0), but here sanity check with what?

Thanks,
Kirti

>> +                if (!buf) {
>> +                    error_report("%s: Error allocating buffer ", __func__);
>> +                    return -ENOMEM;
>> +                }
>> +            }
>> +
>> +            qemu_get_buffer(f, buf, data_size);
>> +
>> +            if (!buffer_mmaped) {
>> +                ret = pwrite(vbasedev->fd, buf, data_size,
>> +                             region->fd_offset + data_offset);
>> +                g_free(buf);
>> +
>> +                if (ret != data_size) {
>> +                    error_report("%s: Failed to set migration buffer %d",
>> +                                 vbasedev->name, ret);
>> +                    return -EINVAL;
>> +                }
>> +            }
>> +
>> +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
>> +                         region->fd_offset +
>> +                       offsetof(struct vfio_device_migration_info, data_size));
>> +            if (ret != sizeof(data_size)) {
>> +                error_report("%s: Failed to set migration buffer data size %d",
>> +                             vbasedev->name, ret);
>> +                if (!buffer_mmaped) {
>> +                    g_free(buf);
>> +                }
>> +                return -EINVAL;
>> +            }
>> +
>> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
>> +                                              data_size);
>> +            break;
>> +        }
> 
> I'd add here a default:  that complains about an unknown tag.
> 
>> +        }
>> +
>> +        ret = qemu_file_get_error(f);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +        data = qemu_get_be64(f);
> 
> I'd also check file_get_error again at this point; if you're unlucky you
> get junk in 'data' and things get more confusing.
> 
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   static SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>>       .save_live_pending = vfio_save_pending,
>>       .save_live_iterate = vfio_save_iterate,
>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>> +    .load_setup = vfio_load_setup,
>> +    .load_cleanup = vfio_load_cleanup,
>> +    .load_state = vfio_load_state,
>>   };
>>   
>>   /* ---------------------------------------------------------------------- */
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index bdf40ba368c7..ac065b559f4e 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -157,3 +157,6 @@ vfio_save_device_config_state(char *name) " (%s)"
>>   vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>>   vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
>>   vfio_save_complete_precopy(char *name) " (%s)"
>> +vfio_load_device_config_state(char *name) " (%s)"
>> +vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
> 
> Please use const char*'s in traces.
> 
>> +vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking
  2020-04-01 19:03   ` Dr. David Alan Gilbert
@ 2020-05-04 23:21     ` Kirti Wankhede
  0 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:21 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 4/2/2020 12:33 AM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
>> for VFIO devices.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> ---
>>   hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 36 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index ab295d25620e..1827b7cfb316 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -9,6 +9,7 @@
>>   
>>   #include "qemu/osdep.h"
>>   #include "qemu/main-loop.h"
>> +#include <sys/ioctl.h>
>>   #include <linux/vfio.h>
>>   
>>   #include "sysemu/runstate.h"
>> @@ -296,6 +297,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>>   
>> +static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
>> +{
>> +    int ret;
>> +    VFIOContainer *container = vbasedev->group->container;
>> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
>> +        .argsz = sizeof(dirty),
>> +    };
>> +
>> +    if (start) {
>> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
>> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
>> +        } else {
>> +            return 0;
>> +        }
>> +    } else {
>> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
>> +    }
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
>> +    if (ret) {
>> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
>> +                     dirty.flags, errno);
>> +    }
>> +    return ret;
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -330,6 +357,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>        */
>>       qemu_put_be64(f, migration->region.size);
>>   
>> +    ret = vfio_start_dirty_page_tracking(vbasedev, true);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>   
>>       ret = qemu_file_get_error(f);
>> @@ -346,6 +378,8 @@ static void vfio_save_cleanup(void *opaque)
>>       VFIODevice *vbasedev = opaque;
>>       VFIOMigration *migration = vbasedev->migration;
>>   
>> +    vfio_start_dirty_page_tracking(vbasedev, false);
> 
> Shouldn't you check the return value?
> 

Even if return value is checked, it will be ignored and this function 
returns void.

Thanks,
Kirti

>> +
>>       if (migration->region.mmaps) {
>>           vfio_region_unmap(&migration->region);
>>       }
>> @@ -669,6 +703,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>>           if (ret) {
>>               error_report("%s: Failed to set state RUNNING", vbasedev->name);
>>           }
>> +
>> +        vfio_start_dirty_page_tracking(vbasedev, false);
>>       }
>>   }
>>   
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-04-07  4:10   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
@ 2020-05-04 23:21     ` Kirti Wankhede
  0 siblings, 0 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-04 23:21 UTC (permalink / raw)
  To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	alex.williamson
  Cc: kevin.tian, yi.l.liu, yan.y.zhao, felipe, eskultet, ziye.yang,
	Ken.Xue, Zhengxiao.zx, shuangtai.tst, qemu-devel, dgilbert,
	pasic, aik, Gonglei (Arei),
	eauger, cohuck, jonathan.davies, cjia, mlevitsk, changpeng.liu,
	zhi.a.wang



On 4/7/2020 9:40 AM, Longpeng (Mike, Cloud Infrastructure Service 
Product Dept.) wrote:
> 
> 
> On 2020/3/25 5:09, Kirti Wankhede wrote:
>> These functions save and restore PCI device specific data - config
>> space of PCI device.
>> Tested save and restore with MSI and MSIX type.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/vfio/vfio-common.h |   2 +
>>   2 files changed, 165 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 6c77c12e44b9..8deb11e87ef7 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -41,6 +41,7 @@
>>   #include "trace.h"
>>   #include "qapi/error.h"
>>   #include "migration/blocker.h"
>> +#include "migration/qemu-file.h"
>>   
>>   #define TYPE_VFIO_PCI "vfio-pci"
>>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>>       }
>>   }
>>   
>> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
>> +{
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    VFIOBAR *bar = &vdev->bars[nr];
>> +    uint64_t addr;
>> +    uint32_t addr_lo, addr_hi = 0;
>> +
>> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
>> +    if (!bar->size) {
>> +        return 0;
>> +    }
>> +
>> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
>> +
>> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
>> +                                       PCI_BASE_ADDRESS_MEM_MASK);
>> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
>> +        addr_hi = pci_default_read_config(pdev,
>> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
>> +    }
>> +
>> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
>> +
>> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
>> +{
>> +    int i, ret;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        ret = vfio_bar_validate(vdev, i);
>> +        if (ret) {
>> +            error_report("vfio: BAR address %d validation failed", i);
>> +            return ret;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>>   static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>   {
>>       VFIOBAR *bar = &vdev->bars[nr];
>> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>       return OBJECT(vdev);
>>   }
>>   
>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint16_t pci_cmd;
>> +    int i;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar;
>> +
>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>> +        qemu_put_be32(f, bar);
>> +    }
>> +
>> +    qemu_put_be32(f, vdev->interrupt);
>> +    if (vdev->interrupt == VFIO_INT_MSI) {
>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +        bool msi_64bit;
>> +
>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                                            2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        msi_addr_lo = pci_default_read_config(pdev,
>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>> +        qemu_put_be32(f, msi_addr_lo);
>> +
>> +        if (msi_64bit) {
>> +            msi_addr_hi = pci_default_read_config(pdev,
>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                             4);
>> +        }
>> +        qemu_put_be32(f, msi_addr_hi);
>> +
>> +        msi_data = pci_default_read_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                2);
>> +        qemu_put_be32(f, msi_data);
>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
>> +        uint16_t offset;
>> +
>> +        /* save enable bit and maskall bit */
>> +        offset = pci_default_read_config(pdev,
>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>> +        qemu_put_be16(f, offset);
>> +        msix_save(pdev, f);
>> +    }
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    qemu_put_be16(f, pci_cmd);
>> +}
>> +
>> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint32_t interrupt_type;
>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +    uint16_t pci_cmd;
>> +    bool msi_64bit;
>> +    int i, ret;
>> +
>> +    /* retore pci bar configuration */
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar = qemu_get_be32(f);
>> +
>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>> +    }
>> +
>> +    ret = vfio_bars_validate(vdev);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    interrupt_type = qemu_get_be32(f);
>> +
>> +    if (interrupt_type == VFIO_INT_MSI) {
>> +        /* restore msi configuration */
>> +        msi_flags = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>> +
>> +        msi_addr_lo = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>> +                              msi_addr_lo, 4);
>> +
>> +        msi_addr_hi = qemu_get_be32(f);
>> +        if (msi_64bit) {
>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                  msi_addr_hi, 4);
>> +        }
>> +        msi_data = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                msi_data, 2);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>> +        uint16_t offset = qemu_get_be16(f);
>> +
>> +        /* load enable bit and maskall bit */
>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>> +                              offset, 2);
>> +        msix_load(pdev, f);
> Hi Kirti, Alex
> 
> 'msix_load' here may increases the downtime. Our migrate-cap device has 128 msix
> interrupts and the guestos enables all of them, so cost a lot of time (nearly
> 1s) to do 'msix_load', because in 'vfio_msix_vector_do_use' we need disable all
> old interuppts and then append a new one.
> 
> What's your opinions ?
> 

This will be at destination, so device should not be running which means 
interrupts should be disabled already, right?
You only need to append new.

Thanks,
Kirti

>> +    }
>> +    pci_cmd = qemu_get_be16(f);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
>> +    return 0;
>> +}
>> +
>>   static VFIODeviceOps vfio_pci_ops = {
>>       .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>>       .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>>       .vfio_eoi = vfio_intx_eoi,
>>       .vfio_get_object = vfio_pci_get_object,
>> +    .vfio_save_config = vfio_pci_save_config,
>> +    .vfio_load_config = vfio_pci_load_config,
>>   };
>>   
>>   int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 74261feaeac9..d69a7f3ae31e 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>>       int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>>       void (*vfio_eoi)(VFIODevice *vdev);
>>       Object *(*vfio_get_object)(VFIODevice *vdev);
>> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
>> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>>   };
>>   
>>   typedef struct VFIOGroup {
>>
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-05-04 23:18     ` Kirti Wankhede
@ 2020-05-05  4:37       ` Alex Williamson
  2020-05-11  9:53         ` Kirti Wankhede
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2020-05-05  4:37 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 5 May 2020 04:48:14 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/26/2020 3:33 AM, Alex Williamson wrote:
> > On Wed, 25 Mar 2020 02:39:07 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> >> functions. These functions handles pre-copy and stop-and-copy phase.
> >>
> >> In _SAVING|_RUNNING device state or pre-copy phase:
> >> - read pending_bytes. If pending_bytes > 0, go through below steps.
> >> - read data_offset - indicates kernel driver to write data to staging
> >>    buffer.
> >> - read data_size - amount of data in bytes written by vendor driver in
> >>    migration region.
> >> - read data_size bytes of data from data_offset in the migration region.
> >> - Write data packet to file stream as below:
> >> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> >> VFIO_MIG_FLAG_END_OF_STATE }
> >>
> >> In _SAVING device state or stop-and-copy phase
> >> a. read config space of device and save to migration file stream. This
> >>     doesn't need to be from vendor driver. Any other special config state
> >>     from driver can be saved as data in following iteration.
> >> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> >> c. read data_offset - indicates kernel driver to write data to staging
> >>     buffer.
> >> d. read data_size - amount of data in bytes written by vendor driver in
> >>     migration region.
> >> e. read data_size bytes of data from data_offset in the migration region.
> >> f. Write data packet as below:
> >>     {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> >> g. iterate through steps b to f while (pending_bytes > 0)
> >> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> >>
> >> When data region is mapped, its user's responsibility to read data from
> >> data_offset of data_size before moving to next steps.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/migration.c           | 245 +++++++++++++++++++++++++++++++++++++++++-
> >>   hw/vfio/trace-events          |   6 ++
> >>   include/hw/vfio/vfio-common.h |   1 +
> >>   3 files changed, 251 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 033f76526e49..ecbeed5182c2 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >>       return 0;
> >>   }
> >>   
> >> +static void *find_data_region(VFIORegion *region,
> >> +                              uint64_t data_offset,
> >> +                              uint64_t data_size)
> >> +{
> >> +    void *ptr = NULL;
> >> +    int i;
> >> +
> >> +    for (i = 0; i < region->nr_mmaps; i++) {
> >> +        if ((data_offset >= region->mmaps[i].offset) &&
> >> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
> >> +            (data_size <= region->mmaps[i].size)) {  
> > 
> > (data_offset - region->mmaps[i].offset) can be non-zero, so this test
> > is invalid.  Additionally the uapi does not require that a give data
> > chunk fits exclusively within an mmap'd area, it may overlap one or
> > more mmap'd sections of the region, possibly with non-mmap'd areas
> > included.
> >   
> 
> What's the advantage of having mmap and non-mmap overlapped regions?
> Isn't it better to have data section either mapped or trapped?

The spec allows for it, therefore we need to support it.  A vendor
driver might choose to include a header with sequence and checksum
information for each transaction, they might accomplish this by setting
data_offset to a trapped area backed by kernel memory followed by an
area supporting direct mmap to the device.  The target end could then
fault on writing the header if the sequence information is incorrect.
A trapped area at the end of the transaction could allow the vendor
driver to validate a checksum.

> >> +            ptr = region->mmaps[i].mmap + (data_offset -
> >> +                                           region->mmaps[i].offset);
> >> +            break;
> >> +        }
> >> +    }
> >> +    return ptr;
> >> +}
> >> +
> >> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region;
> >> +    uint64_t data_offset = 0, data_size = 0;
> >> +    int ret;
> >> +
> >> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data_offset));
> >> +    if (ret != sizeof(data_offset)) {
> >> +        error_report("%s: Failed to get migration buffer data offset %d",
> >> +                     vbasedev->name, ret);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data_size));
> >> +    if (ret != sizeof(data_size)) {
> >> +        error_report("%s: Failed to get migration buffer data size %d",
> >> +                     vbasedev->name, ret);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    if (data_size > 0) {
> >> +        void *buf = NULL;
> >> +        bool buffer_mmaped;
> >> +
> >> +        if (region->mmaps) {
> >> +            buf = find_data_region(region, data_offset, data_size);
> >> +        }
> >> +
> >> +        buffer_mmaped = (buf != NULL) ? true : false;  
> > 
> > The ternary is unnecessary, "? true : false" is redundant.
> >   
> 
> Removing it.
> 
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            buf = g_try_malloc0(data_size);  
> > 
> > Why do we need zero'd memory?
> >   
> 
> Zeroed memory not required, removing 0
> 
> >> +            if (!buf) {
> >> +                error_report("%s: Error allocating buffer ", __func__);
> >> +                return -ENOMEM;
> >> +            }
> >> +
> >> +            ret = pread(vbasedev->fd, buf, data_size,
> >> +                        region->fd_offset + data_offset);
> >> +            if (ret != data_size) {
> >> +                error_report("%s: Failed to get migration data %d",
> >> +                             vbasedev->name, ret);
> >> +                g_free(buf);
> >> +                return -EINVAL;
> >> +            }
> >> +        }
> >> +
> >> +        qemu_put_be64(f, data_size);
> >> +        qemu_put_buffer(f, buf, data_size);  
> > 
> > This can segfault when mmap'd given the above assumptions about size
> > and layout.
> >   
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            g_free(buf);
> >> +        }
> >> +    } else {
> >> +        qemu_put_be64(f, data_size);  
> > 
> > We insert a zero?  Couldn't we add the section header and end here and
> > skip it entirely?
> >   
> 
> This is used during resuming, data_size 0 indicates end of data.
> 
> >> +    }
> >> +
> >> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> >> +                           migration->pending_bytes);
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    return data_size;
> >> +}
> >> +
> >> +static int vfio_update_pending(VFIODevice *vbasedev)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region;
> >> +    uint64_t pending_bytes = 0;
> >> +    int ret;
> >> +
> >> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             pending_bytes));
> >> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> >> +        error_report("%s: Failed to get pending bytes %d",
> >> +                     vbasedev->name, ret);
> >> +        migration->pending_bytes = 0;
> >> +        return (ret < 0) ? ret : -EINVAL;
> >> +    }
> >> +
> >> +    migration->pending_bytes = pending_bytes;
> >> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> >> +    return 0;
> >> +}
> >> +
> >> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> >> +
> >> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> >> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> >> +    }
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    trace_vfio_save_device_config_state(vbasedev->name);
> >> +
> >> +    return qemu_file_get_error(f);
> >> +}
> >> +
> >>   /* ---------------------------------------------------------------------- */
> >>   
> >>   static int vfio_save_setup(QEMUFile *f, void *opaque)
> >> @@ -154,7 +285,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
> >>           qemu_mutex_unlock_iothread();
> >>           if (ret) {
> >>               error_report("%s: Failed to mmap VFIO migration region %d: %s",
> >> -                         vbasedev->name, migration->region.index,
> >> +                         vbasedev->name, migration->region.nr,
> >>                            strerror(-ret));
> >>               return ret;
> >>           }
> >> @@ -194,9 +325,121 @@ static void vfio_save_cleanup(void *opaque)
> >>       trace_vfio_save_cleanup(vbasedev->name);
> >>   }
> >>   
> >> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> >> +                              uint64_t threshold_size,
> >> +                              uint64_t *res_precopy_only,
> >> +                              uint64_t *res_compatible,
> >> +                              uint64_t *res_postcopy_only)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret;
> >> +
> >> +    ret = vfio_update_pending(vbasedev);
> >> +    if (ret) {
> >> +        return;
> >> +    }
> >> +
> >> +    *res_precopy_only += migration->pending_bytes;
> >> +
> >> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> >> +                            *res_postcopy_only, *res_compatible);
> >> +}
> >> +
> >> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    int ret, data_size;
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> >> +
> >> +    data_size = vfio_save_buffer(f, vbasedev);
> >> +
> >> +    if (data_size < 0) {
> >> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> >> +                     strerror(errno));
> >> +        return data_size;
> >> +    }
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> >> +    if (data_size == 0) {
> >> +        /* indicates data finished, goto complete phase */
> >> +        return 1;  
> > 
> > But it's pending_bytes not data_size that indicates we're done.  How do
> > we get away with ignoring pending_bytes for the save_live_iterate phase?
> >   
> 
> This is requirement mentioned above qemu_savevm_state_iterate() which 
> calls .save_live_iterate.
> 
> /*	
>   * this function has three return values:
>   *   negative: there was one error, and we have -errno.
>   *   0 : We haven't finished, caller have to go again
>   *   1 : We have finished, we can go to complete phase
>   */
> int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
> 
> This is to serialize savevm_state.handlers (or in other words devices).

I've lost all context on this question in the interim, but I think this
highlights my question.  We use pending_bytes to know how close we are
to the end of the stream and data_size to iterate each transaction
within that stream.  So how does data_size == 0 indicate we've
completed the current phase?  It seems like pending_bytes should
indicate that.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-05-04 23:18     ` Kirti Wankhede
@ 2020-05-05  4:37       ` Alex Williamson
  2020-05-06  6:11         ` Yan Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2020-05-05  4:37 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 5 May 2020 04:48:37 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/26/2020 1:26 AM, Alex Williamson wrote:
> > On Wed, 25 Mar 2020 02:39:02 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> These functions save and restore PCI device specific data - config
> >> space of PCI device.
> >> Tested save and restore with MSI and MSIX type.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
> >>   include/hw/vfio/vfio-common.h |   2 +
> >>   2 files changed, 165 insertions(+)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 6c77c12e44b9..8deb11e87ef7 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -41,6 +41,7 @@
> >>   #include "trace.h"
> >>   #include "qapi/error.h"
> >>   #include "migration/blocker.h"
> >> +#include "migration/qemu-file.h"
> >>   
> >>   #define TYPE_VFIO_PCI "vfio-pci"
> >>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> >> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> >>       }
> >>   }
> >>   
> >> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> >> +{
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +    VFIOBAR *bar = &vdev->bars[nr];
> >> +    uint64_t addr;
> >> +    uint32_t addr_lo, addr_hi = 0;
> >> +
> >> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> >> +    if (!bar->size) {
> >> +        return 0;
> >> +    }
> >> +
> >> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> >> +
> >> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> >> +                                       PCI_BASE_ADDRESS_MEM_MASK);  
> > 
> > Nit, &= or combine with previous set.
> >   
> >> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> >> +        addr_hi = pci_default_read_config(pdev,
> >> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> >> +    }
> >> +
> >> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;  
> > 
> > Could we use a union?
> >   
> >> +
> >> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> >> +        return -EINVAL;
> >> +    }  
> > 
> > What specifically are we validating here?  This should be true no
> > matter what we wrote to the BAR or else BAR emulation is broken.  The
> > bits that could make this unaligned are not implemented in the BAR.
> >   
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> >> +{
> >> +    int i, ret;
> >> +
> >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >> +        ret = vfio_bar_validate(vdev, i);
> >> +        if (ret) {
> >> +            error_report("vfio: BAR address %d validation failed", i);
> >> +            return ret;
> >> +        }
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >>   static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
> >>   {
> >>       VFIOBAR *bar = &vdev->bars[nr];
> >> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >>       return OBJECT(vdev);
> >>   }
> >>   
> >> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +    uint16_t pci_cmd;
> >> +    int i;
> >> +
> >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >> +        uint32_t bar;
> >> +
> >> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> >> +        qemu_put_be32(f, bar);
> >> +    }
> >> +
> >> +    qemu_put_be32(f, vdev->interrupt);
> >> +    if (vdev->interrupt == VFIO_INT_MSI) {
> >> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >> +        bool msi_64bit;
> >> +
> >> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                                            2);
> >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >> +
> >> +        msi_addr_lo = pci_default_read_config(pdev,
> >> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> >> +        qemu_put_be32(f, msi_addr_lo);
> >> +
> >> +        if (msi_64bit) {
> >> +            msi_addr_hi = pci_default_read_config(pdev,
> >> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >> +                                             4);
> >> +        }
> >> +        qemu_put_be32(f, msi_addr_hi);
> >> +
> >> +        msi_data = pci_default_read_config(pdev,
> >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >> +                2);
> >> +        qemu_put_be32(f, msi_data);  
> > 
> > Isn't the data field only a u16?
> >   
> 
> Yes, fixing it.
> 
> >> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> >> +        uint16_t offset;
> >> +
> >> +        /* save enable bit and maskall bit */
> >> +        offset = pci_default_read_config(pdev,
> >> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> >> +        qemu_put_be16(f, offset);
> >> +        msix_save(pdev, f);
> >> +    }
> >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >> +    qemu_put_be16(f, pci_cmd);
> >> +}
> >> +
> >> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +    uint32_t interrupt_type;
> >> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >> +    uint16_t pci_cmd;
> >> +    bool msi_64bit;
> >> +    int i, ret;
> >> +
> >> +    /* retore pci bar configuration */
> >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >> +        uint32_t bar = qemu_get_be32(f);
> >> +
> >> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> >> +    }
> >> +
> >> +    ret = vfio_bars_validate(vdev);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    interrupt_type = qemu_get_be32(f);
> >> +
> >> +    if (interrupt_type == VFIO_INT_MSI) {
> >> +        /* restore msi configuration */
> >> +        msi_flags = pci_default_read_config(pdev,
> >> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >> +
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> >> +
> >> +        msi_addr_lo = qemu_get_be32(f);
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> >> +                              msi_addr_lo, 4);
> >> +
> >> +        msi_addr_hi = qemu_get_be32(f);
> >> +        if (msi_64bit) {
> >> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >> +                                  msi_addr_hi, 4);
> >> +        }
> >> +        msi_data = qemu_get_be32(f);
> >> +        vfio_pci_write_config(pdev,
> >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >> +                msi_data, 2);
> >> +
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> >> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> >> +        uint16_t offset = qemu_get_be16(f);
> >> +
> >> +        /* load enable bit and maskall bit */
> >> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> >> +                              offset, 2);
> >> +        msix_load(pdev, f);
> >> +    }
> >> +    pci_cmd = qemu_get_be16(f);
> >> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> >> +    return 0;
> >> +}  
> > 
> > It always seems like there should be a lot more state than this, and I
> > probably sound like a broken record because I ask every time, but maybe
> > that's a good indication that we (or at least I) need a comment
> > explaining why we only care about these.  For example, what if we
> > migrate a device in the D3 power state, don't we need to account for
> > the state stored in the PM capability or does the device wake up into
> > D0 auto-magically after migration?  I think we could repeat that
> > question for every capability that can be modified.  Even for the MSI/X
> > cases, the interrupt may not be active, but there could be state in
> > virtual config space that would be different on the target.  For
> > example, if we migrate with a device in INTx mode where the guest had
> > written vector fields on the source, but only writes the enable bit on
> > the target, can we seamlessly figure out the rest?  For other
> > capabilities, that state may represent config space changes written
> > through to the physical device and represent a functional difference on
> > the target.  Thanks,
> >  
> 
> These are very basic set of registers from config state. Other are more 
> of vendor specific which vendor driver can save and restore in their own 
> data. I don't think we have to take care of all those vendor specific 
> fields here.

That had not been clear to me.  Intel folks, is this your understanding
regarding the responsibility of the user to save and restore config
space of the device as part of the vendor provided migration stream
data?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-05-04 23:19     ` Kirti Wankhede
@ 2020-05-05  4:37       ` Alex Williamson
  2020-05-06  6:38         ` Yan Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2020-05-05  4:37 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 5 May 2020 04:49:10 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/26/2020 2:32 AM, Alex Williamson wrote:
> > On Wed, 25 Mar 2020 02:39:06 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Define flags to be used as delimeter in migration file stream.
> >> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> >> region from these functions at source during saving or pre-copy phase.
> >> Set VFIO device state depending on VM's state. During live migration, VM is
> >> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> >> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/migration.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events |  2 ++
> >>   2 files changed, 78 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 22ded9d28cf3..033f76526e49 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -8,6 +8,7 @@
> >>    */
> >>   
> >>   #include "qemu/osdep.h"
> >> +#include "qemu/main-loop.h"
> >>   #include <linux/vfio.h>
> >>   
> >>   #include "sysemu/runstate.h"
> >> @@ -24,6 +25,17 @@
> >>   #include "pci.h"
> >>   #include "trace.h"
> >>   
> >> +/*
> >> + * Flags used as delimiter:
> >> + * 0xffffffff => MSB 32-bit all 1s
> >> + * 0xef10     => emulated (virtual) function IO
> >> + * 0x0000     => 16-bits reserved for flags
> >> + */
> >> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> >> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> >> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> >> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> >> +
> >>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
> >>   {
> >>       VFIOMigration *migration = vbasedev->migration;
> >> @@ -126,6 +138,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >>       return 0;
> >>   }
> >>   
> >> +/* ---------------------------------------------------------------------- */
> >> +
> >> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret;
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> >> +
> >> +    if (migration->region.mmaps) {
> >> +        qemu_mutex_lock_iothread();
> >> +        ret = vfio_region_mmap(&migration->region);
> >> +        qemu_mutex_unlock_iothread();
> >> +        if (ret) {
> >> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> >> +                         vbasedev->name, migration->region.index,
> >> +                         strerror(-ret));
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
> >> +    if (ret) {
> >> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> >> +        return ret;
> >> +    }
> >> +
> >> +    /*
> >> +     * Save migration region size. This is used to verify migration region size
> >> +     * is greater than or equal to migration region size at destination
> >> +     */
> >> +    qemu_put_be64(f, migration->region.size);  
> > 
> > Is this requirement supported by the uapi?    
> 
> Yes, on UAPI thread we discussed this:
> 
>   * For the user application, data is opaque. The user application 
> should write
>   * data in the same order as the data is received and the data should be of
>   * same transaction size at the source.
> 
> data should be same transaction size, so migration region size should be 
> greater than or equal to the size at source when verifying at destination.

We are that user application for which the data is opaque, therefore we
should make no assumptions about how the vendor driver makes use of
their region.  If we get a transaction that exceeds the end of the
region, I agree, that would be an error.  But we have no business
predicting that such a transaction might occur if the vendor driver
indicates it can support the migration.

> > The vendor driver operates
> > within the migration region, but it has no requirement to use the full
> > extent of the region.  Shouldn't we instead insert the version string
> > from versioning API Yan proposed?  Is this were we might choose to use
> > an interface via the vfio API rather than sysfs if we had one?
> >  
> 
> VFIO API cannot be used by libvirt or management tool stack. We need 
> sysfs as Yan proposed to be used by libvirt or management tool stack.

It's been a long time, but that doesn't seem like what I was asking.
The sysfs version checking is used to select a target that is likely to
succeed, but the migration stream is still generated by a user and the
vendor driver is still ultimately responsible for validating that
stream.  I would hope that a vendor migration stream therefore starts
with information similar to that found in the sysfs interface, allowing
the receiving vendor driver to validate the source device and vendor
software version, such that we can fail an incoming migration that the
vendor driver deems incompatible.  Ideally the vendor driver might also
include consistency and sequence checking throughout the stream to
prevent a malicious user from exploiting the internal operation of the
vendor driver.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-05-05  4:37       ` Alex Williamson
@ 2020-05-06  6:11         ` Yan Zhao
  2020-05-06 19:48           ` Kirti Wankhede
  0 siblings, 1 reply; 74+ messages in thread
From: Yan Zhao @ 2020-05-06  6:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, May 05, 2020 at 12:37:11PM +0800, Alex Williamson wrote:
> On Tue, 5 May 2020 04:48:37 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 3/26/2020 1:26 AM, Alex Williamson wrote:
> > > On Wed, 25 Mar 2020 02:39:02 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > >> These functions save and restore PCI device specific data - config
> > >> space of PCI device.
> > >> Tested save and restore with MSI and MSIX type.
> > >>
> > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >> ---
> > >>   hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
> > >>   include/hw/vfio/vfio-common.h |   2 +
> > >>   2 files changed, 165 insertions(+)
> > >>
> > >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > >> index 6c77c12e44b9..8deb11e87ef7 100644
> > >> --- a/hw/vfio/pci.c
> > >> +++ b/hw/vfio/pci.c
> > >> @@ -41,6 +41,7 @@
> > >>   #include "trace.h"
> > >>   #include "qapi/error.h"
> > >>   #include "migration/blocker.h"
> > >> +#include "migration/qemu-file.h"
> > >>   
> > >>   #define TYPE_VFIO_PCI "vfio-pci"
> > >>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> > >> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> > >>       }
> > >>   }
> > >>   
> > >> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> > >> +{
> > >> +    PCIDevice *pdev = &vdev->pdev;
> > >> +    VFIOBAR *bar = &vdev->bars[nr];
> > >> +    uint64_t addr;
> > >> +    uint32_t addr_lo, addr_hi = 0;
> > >> +
> > >> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> > >> +    if (!bar->size) {
> > >> +        return 0;
> > >> +    }
> > >> +
> > >> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> > >> +
> > >> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> > >> +                                       PCI_BASE_ADDRESS_MEM_MASK);  
> > > 
> > > Nit, &= or combine with previous set.
> > >   
> > >> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > >> +        addr_hi = pci_default_read_config(pdev,
> > >> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> > >> +    }
> > >> +
> > >> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;  
> > > 
> > > Could we use a union?
> > >   
> > >> +
> > >> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> > >> +        return -EINVAL;
> > >> +    }  
> > > 
> > > What specifically are we validating here?  This should be true no
> > > matter what we wrote to the BAR or else BAR emulation is broken.  The
> > > bits that could make this unaligned are not implemented in the BAR.
> > >   
> > >> +
> > >> +    return 0;
> > >> +}
> > >> +
> > >> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> > >> +{
> > >> +    int i, ret;
> > >> +
> > >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > >> +        ret = vfio_bar_validate(vdev, i);
> > >> +        if (ret) {
> > >> +            error_report("vfio: BAR address %d validation failed", i);
> > >> +            return ret;
> > >> +        }
> > >> +    }
> > >> +    return 0;
> > >> +}
> > >> +
> > >>   static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
> > >>   {
> > >>       VFIOBAR *bar = &vdev->bars[nr];
> > >> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> > >>       return OBJECT(vdev);
> > >>   }
> > >>   
> > >> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > >> +{
> > >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > >> +    PCIDevice *pdev = &vdev->pdev;
> > >> +    uint16_t pci_cmd;
> > >> +    int i;
> > >> +
> > >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > >> +        uint32_t bar;
> > >> +
> > >> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > >> +        qemu_put_be32(f, bar);
> > >> +    }
> > >> +
> > >> +    qemu_put_be32(f, vdev->interrupt);
> > >> +    if (vdev->interrupt == VFIO_INT_MSI) {
> > >> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > >> +        bool msi_64bit;
> > >> +
> > >> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > >> +                                            2);
> > >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > >> +
> > >> +        msi_addr_lo = pci_default_read_config(pdev,
> > >> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > >> +        qemu_put_be32(f, msi_addr_lo);
> > >> +
> > >> +        if (msi_64bit) {
> > >> +            msi_addr_hi = pci_default_read_config(pdev,
> > >> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > >> +                                             4);
> > >> +        }
> > >> +        qemu_put_be32(f, msi_addr_hi);
> > >> +
> > >> +        msi_data = pci_default_read_config(pdev,
> > >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > >> +                2);
> > >> +        qemu_put_be32(f, msi_data);  
> > > 
> > > Isn't the data field only a u16?
> > >   
> > 
> > Yes, fixing it.
> > 
> > >> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> > >> +        uint16_t offset;
> > >> +
> > >> +        /* save enable bit and maskall bit */
> > >> +        offset = pci_default_read_config(pdev,
> > >> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > >> +        qemu_put_be16(f, offset);
> > >> +        msix_save(pdev, f);
> > >> +    }
> > >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > >> +    qemu_put_be16(f, pci_cmd);
> > >> +}
> > >> +
> > >> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > >> +{
> > >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > >> +    PCIDevice *pdev = &vdev->pdev;
> > >> +    uint32_t interrupt_type;
> > >> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > >> +    uint16_t pci_cmd;
> > >> +    bool msi_64bit;
> > >> +    int i, ret;
> > >> +
> > >> +    /* retore pci bar configuration */
> > >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > >> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > >> +        uint32_t bar = qemu_get_be32(f);
> > >> +
> > >> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > >> +    }
> > >> +
> > >> +    ret = vfio_bars_validate(vdev);
> > >> +    if (ret) {
> > >> +        return ret;
> > >> +    }
> > >> +
> > >> +    interrupt_type = qemu_get_be32(f);
> > >> +
> > >> +    if (interrupt_type == VFIO_INT_MSI) {
> > >> +        /* restore msi configuration */
> > >> +        msi_flags = pci_default_read_config(pdev,
> > >> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > >> +
> > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > >> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> > >> +
> > >> +        msi_addr_lo = qemu_get_be32(f);
> > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> > >> +                              msi_addr_lo, 4);
> > >> +
> > >> +        msi_addr_hi = qemu_get_be32(f);
> > >> +        if (msi_64bit) {
> > >> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > >> +                                  msi_addr_hi, 4);
> > >> +        }
> > >> +        msi_data = qemu_get_be32(f);
> > >> +        vfio_pci_write_config(pdev,
> > >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > >> +                msi_data, 2);
> > >> +
> > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > >> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> > >> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> > >> +        uint16_t offset = qemu_get_be16(f);
> > >> +
> > >> +        /* load enable bit and maskall bit */
> > >> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> > >> +                              offset, 2);
> > >> +        msix_load(pdev, f);
> > >> +    }
> > >> +    pci_cmd = qemu_get_be16(f);
> > >> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> > >> +    return 0;
> > >> +}  
> > > 
> > > It always seems like there should be a lot more state than this, and I
> > > probably sound like a broken record because I ask every time, but maybe
> > > that's a good indication that we (or at least I) need a comment
> > > explaining why we only care about these.  For example, what if we
> > > migrate a device in the D3 power state, don't we need to account for
> > > the state stored in the PM capability or does the device wake up into
> > > D0 auto-magically after migration?  I think we could repeat that
> > > question for every capability that can be modified.  Even for the MSI/X
> > > cases, the interrupt may not be active, but there could be state in
> > > virtual config space that would be different on the target.  For
> > > example, if we migrate with a device in INTx mode where the guest had
> > > written vector fields on the source, but only writes the enable bit on
> > > the target, can we seamlessly figure out the rest?  For other
> > > capabilities, that state may represent config space changes written
> > > through to the physical device and represent a functional difference on
> > > the target.  Thanks,
> > >  
> > 
> > These are very basic set of registers from config state. Other are more 
> > of vendor specific which vendor driver can save and restore in their own 
> > data. I don't think we have to take care of all those vendor specific 
> > fields here.
> 
> That had not been clear to me.  Intel folks, is this your understanding
> regarding the responsibility of the user to save and restore config
> space of the device as part of the vendor provided migration stream
> data?  Thanks,
> 
Currently, the code works for us. but I agree with you that there should
be more states to save, at least for emulated config bits.
I think we should call pci_device_save() to serve that purpose.

Thanks
Yan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-05-05  4:37       ` Alex Williamson
@ 2020-05-06  6:38         ` Yan Zhao
  2020-05-06  9:58           ` Cornelia Huck
  0 siblings, 1 reply; 74+ messages in thread
From: Yan Zhao @ 2020-05-06  6:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, May 05, 2020 at 12:37:26PM +0800, Alex Williamson wrote:
> On Tue, 5 May 2020 04:49:10 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 3/26/2020 2:32 AM, Alex Williamson wrote:
> > > On Wed, 25 Mar 2020 02:39:06 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > >> Define flags to be used as delimeter in migration file stream.
> > >> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> > >> region from these functions at source during saving or pre-copy phase.
> > >> Set VFIO device state depending on VM's state. During live migration, VM is
> > >> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> > >> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> > >>
> > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >> ---
> > >>   hw/vfio/migration.c  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>   hw/vfio/trace-events |  2 ++
> > >>   2 files changed, 78 insertions(+)
> > >>
> > >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > >> index 22ded9d28cf3..033f76526e49 100644
> > >> --- a/hw/vfio/migration.c
> > >> +++ b/hw/vfio/migration.c
> > >> @@ -8,6 +8,7 @@
> > >>    */
> > >>   
> > >>   #include "qemu/osdep.h"
> > >> +#include "qemu/main-loop.h"
> > >>   #include <linux/vfio.h>
> > >>   
> > >>   #include "sysemu/runstate.h"
> > >> @@ -24,6 +25,17 @@
> > >>   #include "pci.h"
> > >>   #include "trace.h"
> > >>   
> > >> +/*
> > >> + * Flags used as delimiter:
> > >> + * 0xffffffff => MSB 32-bit all 1s
> > >> + * 0xef10     => emulated (virtual) function IO
> > >> + * 0x0000     => 16-bits reserved for flags
> > >> + */
> > >> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> > >> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> > >> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> > >> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> > >> +
> > >>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
> > >>   {
> > >>       VFIOMigration *migration = vbasedev->migration;
> > >> @@ -126,6 +138,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> > >>       return 0;
> > >>   }
> > >>   
> > >> +/* ---------------------------------------------------------------------- */
> > >> +
> > >> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > >> +{
> > >> +    VFIODevice *vbasedev = opaque;
> > >> +    VFIOMigration *migration = vbasedev->migration;
> > >> +    int ret;
> > >> +
> > >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> > >> +
> > >> +    if (migration->region.mmaps) {
> > >> +        qemu_mutex_lock_iothread();
> > >> +        ret = vfio_region_mmap(&migration->region);
> > >> +        qemu_mutex_unlock_iothread();
> > >> +        if (ret) {
> > >> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> > >> +                         vbasedev->name, migration->region.index,
> > >> +                         strerror(-ret));
> > >> +            return ret;
> > >> +        }
> > >> +    }
> > >> +
> > >> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
> > >> +    if (ret) {
> > >> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> > >> +        return ret;
> > >> +    }
> > >> +
> > >> +    /*
> > >> +     * Save migration region size. This is used to verify migration region size
> > >> +     * is greater than or equal to migration region size at destination
> > >> +     */
> > >> +    qemu_put_be64(f, migration->region.size);  
> > > 
> > > Is this requirement supported by the uapi?    
> > 
> > Yes, on UAPI thread we discussed this:
> > 
> >   * For the user application, data is opaque. The user application 
> > should write
> >   * data in the same order as the data is received and the data should be of
> >   * same transaction size at the source.
> > 
> > data should be same transaction size, so migration region size should be 
> > greater than or equal to the size at source when verifying at destination.
> 
> We are that user application for which the data is opaque, therefore we
> should make no assumptions about how the vendor driver makes use of
> their region.  If we get a transaction that exceeds the end of the
> region, I agree, that would be an error.  But we have no business
> predicting that such a transaction might occur if the vendor driver
> indicates it can support the migration.
> 
> > > The vendor driver operates
> > > within the migration region, but it has no requirement to use the full
> > > extent of the region.  Shouldn't we instead insert the version string
> > > from versioning API Yan proposed?  Is this were we might choose to use
> > > an interface via the vfio API rather than sysfs if we had one?
> > >  
> > 
> > VFIO API cannot be used by libvirt or management tool stack. We need 
> > sysfs as Yan proposed to be used by libvirt or management tool stack.
> 
> It's been a long time, but that doesn't seem like what I was asking.
> The sysfs version checking is used to select a target that is likely to
> succeed, but the migration stream is still generated by a user and the
> vendor driver is still ultimately responsible for validating that
> stream.  I would hope that a vendor migration stream therefore starts
> with information similar to that found in the sysfs interface, allowing
> the receiving vendor driver to validate the source device and vendor
> software version, such that we can fail an incoming migration that the
> vendor driver deems incompatible.  Ideally the vendor driver might also
> include consistency and sequence checking throughout the stream to
> prevent a malicious user from exploiting the internal operation of the
> vendor driver.  Thanks,
> 
maybe we can add a rw field migration_version in
struct vfio_device_migration_info besides sysfs interface ?

when reading it in src, it gets the same string as that from sysfs;
when writing it in target, it returns success or not to check
compatibility and fails the migration early in setup phase.

Thanks
Yan.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-05-06  6:38         ` Yan Zhao
@ 2020-05-06  9:58           ` Cornelia Huck
  2020-05-06 16:53             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 74+ messages in thread
From: Cornelia Huck @ 2020-05-06  9:58 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, Ken.Xue,
	Zhengxiao.zx, shuangtai.tst, Kirti Wankhede, Wang, Zhi A,
	mlevitsk, pasic, aik, Alex Williamson, eauger, qemu-devel,
	felipe, jonathan.davies, Liu,  Changpeng, dgilbert

On Wed, 6 May 2020 02:38:46 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, May 05, 2020 at 12:37:26PM +0800, Alex Williamson wrote:
> > It's been a long time, but that doesn't seem like what I was asking.
> > The sysfs version checking is used to select a target that is likely to
> > succeed, but the migration stream is still generated by a user and the
> > vendor driver is still ultimately responsible for validating that
> > stream.  I would hope that a vendor migration stream therefore starts
> > with information similar to that found in the sysfs interface, allowing
> > the receiving vendor driver to validate the source device and vendor
> > software version, such that we can fail an incoming migration that the
> > vendor driver deems incompatible.  Ideally the vendor driver might also
> > include consistency and sequence checking throughout the stream to
> > prevent a malicious user from exploiting the internal operation of the
> > vendor driver.  Thanks,

Some kind of somewhat standardized marker for driver/version seems like
a good idea. Further checking is also a good idea, but I think the
details of that need to be left to the individual drivers.

> >   
> maybe we can add a rw field migration_version in
> struct vfio_device_migration_info besides sysfs interface ?
> 
> when reading it in src, it gets the same string as that from sysfs;
> when writing it in target, it returns success or not to check
> compatibility and fails the migration early in setup phase.

Getting both populated from the same source seems like a good idea.

Not sure if a string is the best value to put into a migration stream;
maybe the sysfs interface can derive a human-readable string from a
more compact value to be put into the migration region (and ultimately
the stream)? Might be overengineering, just thinking out aloud here.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-05-06  9:58           ` Cornelia Huck
@ 2020-05-06 16:53             ` Dr. David Alan Gilbert
  2020-05-06 19:30               ` Kirti Wankhede
  0 siblings, 1 reply; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-06 16:53 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, qemu-devel,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, Kirti Wankhede,
	Wang, Zhi A, mlevitsk, pasic, aik, Alex Williamson, eauger,
	felipe, jonathan.davies, Yan Zhao, Liu, Changpeng, Ken.Xue

* Cornelia Huck (cohuck@redhat.com) wrote:
> On Wed, 6 May 2020 02:38:46 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, May 05, 2020 at 12:37:26PM +0800, Alex Williamson wrote:
> > > It's been a long time, but that doesn't seem like what I was asking.
> > > The sysfs version checking is used to select a target that is likely to
> > > succeed, but the migration stream is still generated by a user and the
> > > vendor driver is still ultimately responsible for validating that
> > > stream.  I would hope that a vendor migration stream therefore starts
> > > with information similar to that found in the sysfs interface, allowing
> > > the receiving vendor driver to validate the source device and vendor
> > > software version, such that we can fail an incoming migration that the
> > > vendor driver deems incompatible.  Ideally the vendor driver might also
> > > include consistency and sequence checking throughout the stream to
> > > prevent a malicious user from exploiting the internal operation of the
> > > vendor driver.  Thanks,
> 
> Some kind of somewhat standardized marker for driver/version seems like
> a good idea. Further checking is also a good idea, but I think the
> details of that need to be left to the individual drivers.

Standardised markers like that would be useful; although the rules of
how to compare them might be a bit vendor specific; but still - it would
be good for us to be able to dump something out when it all goes wrong.

> > >   
> > maybe we can add a rw field migration_version in
> > struct vfio_device_migration_info besides sysfs interface ?
> > 
> > when reading it in src, it gets the same string as that from sysfs;
> > when writing it in target, it returns success or not to check
> > compatibility and fails the migration early in setup phase.
> 
> Getting both populated from the same source seems like a good idea.
> 
> Not sure if a string is the best value to put into a migration stream;
> maybe the sysfs interface can derive a human-readable string from a
> more compact value to be put into the migration region (and ultimately
> the stream)? Might be overengineering, just thinking out aloud here.

A string might be OK fi you specify a little about it.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-05-06 16:53             ` Dr. David Alan Gilbert
@ 2020-05-06 19:30               ` Kirti Wankhede
  2020-05-07  6:37                 ` Cornelia Huck
  2020-05-07 20:29                 ` Alex Williamson
  0 siblings, 2 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-06 19:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Cornelia Huck
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, qemu-devel, Wang,
	 Zhi A, mlevitsk, pasic, aik, Alex Williamson, eauger, felipe,
	jonathan.davies, Yan Zhao, Liu,  Changpeng, Ken.Xue



On 5/6/2020 10:23 PM, Dr. David Alan Gilbert wrote:
> * Cornelia Huck (cohuck@redhat.com) wrote:
>> On Wed, 6 May 2020 02:38:46 -0400
>> Yan Zhao <yan.y.zhao@intel.com> wrote:
>>
>>> On Tue, May 05, 2020 at 12:37:26PM +0800, Alex Williamson wrote:
>>>> It's been a long time, but that doesn't seem like what I was asking.
>>>> The sysfs version checking is used to select a target that is likely to
>>>> succeed, but the migration stream is still generated by a user and the
>>>> vendor driver is still ultimately responsible for validating that
>>>> stream.  I would hope that a vendor migration stream therefore starts
>>>> with information similar to that found in the sysfs interface, allowing
>>>> the receiving vendor driver to validate the source device and vendor
>>>> software version, such that we can fail an incoming migration that the
>>>> vendor driver deems incompatible.  Ideally the vendor driver might also
>>>> include consistency and sequence checking throughout the stream to
>>>> prevent a malicious user from exploiting the internal operation of the
>>>> vendor driver.  Thanks,
>>
>> Some kind of somewhat standardized marker for driver/version seems like
>> a good idea. Further checking is also a good idea, but I think the
>> details of that need to be left to the individual drivers.
> 
> Standardised markers like that would be useful; although the rules of
> how to compare them might be a bit vendor specific; but still - it would
> be good for us to be able to dump something out when it all goes wrong.
> 

Such checking should already there in vendor driver. Vendor driver might 
also support across version migration. I think checking in QEMU again 
would be redundant. Let vendor driver handle version checks.

Thanks,
Kirti

>>>>    
>>> maybe we can add a rw field migration_version in
>>> struct vfio_device_migration_info besides sysfs interface ?
>>>
>>> when reading it in src, it gets the same string as that from sysfs;
>>> when writing it in target, it returns success or not to check
>>> compatibility and fails the migration early in setup phase.
>>
>> Getting both populated from the same source seems like a good idea.
>>
>> Not sure if a string is the best value to put into a migration stream;
>> maybe the sysfs interface can derive a human-readable string from a
>> more compact value to be put into the migration region (and ultimately
>> the stream)? Might be overengineering, just thinking out aloud here.
> 
> A string might be OK fi you specify a little about it.
> 
> Dave
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-05-06  6:11         ` Yan Zhao
@ 2020-05-06 19:48           ` Kirti Wankhede
  2020-05-06 20:03             ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-06 19:48 UTC (permalink / raw)
  To: Yan Zhao, Alex Williamson
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies, Liu,
	Changpeng, Ken.Xue



On 5/6/2020 11:41 AM, Yan Zhao wrote:
> On Tue, May 05, 2020 at 12:37:11PM +0800, Alex Williamson wrote:
>> On Tue, 5 May 2020 04:48:37 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>>> On 3/26/2020 1:26 AM, Alex Williamson wrote:
>>>> On Wed, 25 Mar 2020 02:39:02 +0530
>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>    
>>>>> These functions save and restore PCI device specific data - config
>>>>> space of PCI device.
>>>>> Tested save and restore with MSI and MSIX type.
>>>>>
>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>> ---
>>>>>    hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>>>>>    include/hw/vfio/vfio-common.h |   2 +
>>>>>    2 files changed, 165 insertions(+)
>>>>>
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index 6c77c12e44b9..8deb11e87ef7 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -41,6 +41,7 @@
>>>>>    #include "trace.h"
>>>>>    #include "qapi/error.h"
>>>>>    #include "migration/blocker.h"
>>>>> +#include "migration/qemu-file.h"
>>>>>    
>>>>>    #define TYPE_VFIO_PCI "vfio-pci"
>>>>>    #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>>>>> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>>>>>        }
>>>>>    }
>>>>>    
>>>>> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
>>>>> +{
>>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>>> +    VFIOBAR *bar = &vdev->bars[nr];
>>>>> +    uint64_t addr;
>>>>> +    uint32_t addr_lo, addr_hi = 0;
>>>>> +
>>>>> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
>>>>> +    if (!bar->size) {
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
>>>>> +
>>>>> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
>>>>> +                                       PCI_BASE_ADDRESS_MEM_MASK);
>>>>
>>>> Nit, &= or combine with previous set.
>>>>    
>>>>> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
>>>>> +        addr_hi = pci_default_read_config(pdev,
>>>>> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
>>>>> +    }
>>>>> +
>>>>> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
>>>>
>>>> Could we use a union?
>>>>    
>>>>> +
>>>>> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
>>>>> +        return -EINVAL;
>>>>> +    }
>>>>
>>>> What specifically are we validating here?  This should be true no
>>>> matter what we wrote to the BAR or else BAR emulation is broken.  The
>>>> bits that could make this unaligned are not implemented in the BAR.
>>>>    
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
>>>>> +{
>>>>> +    int i, ret;
>>>>> +
>>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>>>>> +        ret = vfio_bar_validate(vdev, i);
>>>>> +        if (ret) {
>>>>> +            error_report("vfio: BAR address %d validation failed", i);
>>>>> +            return ret;
>>>>> +        }
>>>>> +    }
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>>    static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>>>>    {
>>>>>        VFIOBAR *bar = &vdev->bars[nr];
>>>>> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>>>>        return OBJECT(vdev);
>>>>>    }
>>>>>    
>>>>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>>>>> +{
>>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>>> +    uint16_t pci_cmd;
>>>>> +    int i;
>>>>> +
>>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>>>>> +        uint32_t bar;
>>>>> +
>>>>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>>>>> +        qemu_put_be32(f, bar);
>>>>> +    }
>>>>> +
>>>>> +    qemu_put_be32(f, vdev->interrupt);
>>>>> +    if (vdev->interrupt == VFIO_INT_MSI) {
>>>>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>>>>> +        bool msi_64bit;
>>>>> +
>>>>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>>>>> +                                            2);
>>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>>>>> +
>>>>> +        msi_addr_lo = pci_default_read_config(pdev,
>>>>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>>>>> +        qemu_put_be32(f, msi_addr_lo);
>>>>> +
>>>>> +        if (msi_64bit) {
>>>>> +            msi_addr_hi = pci_default_read_config(pdev,
>>>>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>>>>> +                                             4);
>>>>> +        }
>>>>> +        qemu_put_be32(f, msi_addr_hi);
>>>>> +
>>>>> +        msi_data = pci_default_read_config(pdev,
>>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>>>>> +                2);
>>>>> +        qemu_put_be32(f, msi_data);
>>>>
>>>> Isn't the data field only a u16?
>>>>    
>>>
>>> Yes, fixing it.
>>>
>>>>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
>>>>> +        uint16_t offset;
>>>>> +
>>>>> +        /* save enable bit and maskall bit */
>>>>> +        offset = pci_default_read_config(pdev,
>>>>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>>>>> +        qemu_put_be16(f, offset);
>>>>> +        msix_save(pdev, f);
>>>>> +    }
>>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>>>>> +    qemu_put_be16(f, pci_cmd);
>>>>> +}
>>>>> +
>>>>> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>>>>> +{
>>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>>> +    uint32_t interrupt_type;
>>>>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>>>>> +    uint16_t pci_cmd;
>>>>> +    bool msi_64bit;
>>>>> +    int i, ret;
>>>>> +
>>>>> +    /* retore pci bar configuration */
>>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>>>>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
>>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>>>>> +        uint32_t bar = qemu_get_be32(f);
>>>>> +
>>>>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>>>>> +    }
>>>>> +
>>>>> +    ret = vfio_bars_validate(vdev);
>>>>> +    if (ret) {
>>>>> +        return ret;
>>>>> +    }
>>>>> +
>>>>> +    interrupt_type = qemu_get_be32(f);
>>>>> +
>>>>> +    if (interrupt_type == VFIO_INT_MSI) {
>>>>> +        /* restore msi configuration */
>>>>> +        msi_flags = pci_default_read_config(pdev,
>>>>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>>>>> +
>>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>>>>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>>>>> +
>>>>> +        msi_addr_lo = qemu_get_be32(f);
>>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>>>>> +                              msi_addr_lo, 4);
>>>>> +
>>>>> +        msi_addr_hi = qemu_get_be32(f);
>>>>> +        if (msi_64bit) {
>>>>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>>>>> +                                  msi_addr_hi, 4);
>>>>> +        }
>>>>> +        msi_data = qemu_get_be32(f);
>>>>> +        vfio_pci_write_config(pdev,
>>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>>>>> +                msi_data, 2);
>>>>> +
>>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>>>>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>>>>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>>>>> +        uint16_t offset = qemu_get_be16(f);
>>>>> +
>>>>> +        /* load enable bit and maskall bit */
>>>>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>>>>> +                              offset, 2);
>>>>> +        msix_load(pdev, f);
>>>>> +    }
>>>>> +    pci_cmd = qemu_get_be16(f);
>>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
>>>>> +    return 0;
>>>>> +}
>>>>
>>>> It always seems like there should be a lot more state than this, and I
>>>> probably sound like a broken record because I ask every time, but maybe
>>>> that's a good indication that we (or at least I) need a comment
>>>> explaining why we only care about these.  For example, what if we
>>>> migrate a device in the D3 power state, don't we need to account for
>>>> the state stored in the PM capability or does the device wake up into
>>>> D0 auto-magically after migration?  I think we could repeat that
>>>> question for every capability that can be modified.  Even for the MSI/X
>>>> cases, the interrupt may not be active, but there could be state in
>>>> virtual config space that would be different on the target.  For
>>>> example, if we migrate with a device in INTx mode where the guest had
>>>> written vector fields on the source, but only writes the enable bit on
>>>> the target, can we seamlessly figure out the rest?  For other
>>>> capabilities, that state may represent config space changes written
>>>> through to the physical device and represent a functional difference on
>>>> the target.  Thanks,
>>>>   
>>>
>>> These are very basic set of registers from config state. Other are more
>>> of vendor specific which vendor driver can save and restore in their own
>>> data. I don't think we have to take care of all those vendor specific
>>> fields here.
>>
>> That had not been clear to me.  Intel folks, is this your understanding
>> regarding the responsibility of the user to save and restore config
>> space of the device as part of the vendor provided migration stream
>> data?  Thanks,
>>
> Currently, the code works for us. but I agree with you that there should
> be more states to save, at least for emulated config bits.
> I think we should call pci_device_save() to serve that purpose.
> 

If vendor driver can restore all vendor specific config space, then 
adding it again in QEMU might be redundant. As an example, I had mailed 
mtty sample code, in which config space has vendor specific information 
and that is restored in easy way.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-05-06 19:48           ` Kirti Wankhede
@ 2020-05-06 20:03             ` Alex Williamson
  2020-05-07  5:40               ` Kirti Wankhede
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2020-05-06 20:03 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies, Yan Zhao,
	Liu, Changpeng, Ken.Xue

On Thu, 7 May 2020 01:18:19 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 5/6/2020 11:41 AM, Yan Zhao wrote:
> > On Tue, May 05, 2020 at 12:37:11PM +0800, Alex Williamson wrote:  
> >> On Tue, 5 May 2020 04:48:37 +0530
> >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>  
> >>> On 3/26/2020 1:26 AM, Alex Williamson wrote:  
> >>>> On Wed, 25 Mar 2020 02:39:02 +0530
> >>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>      
> >>>>> These functions save and restore PCI device specific data - config
> >>>>> space of PCI device.
> >>>>> Tested save and restore with MSI and MSIX type.
> >>>>>
> >>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>>> ---
> >>>>>    hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
> >>>>>    include/hw/vfio/vfio-common.h |   2 +
> >>>>>    2 files changed, 165 insertions(+)
> >>>>>
> >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>>>> index 6c77c12e44b9..8deb11e87ef7 100644
> >>>>> --- a/hw/vfio/pci.c
> >>>>> +++ b/hw/vfio/pci.c
> >>>>> @@ -41,6 +41,7 @@
> >>>>>    #include "trace.h"
> >>>>>    #include "qapi/error.h"
> >>>>>    #include "migration/blocker.h"
> >>>>> +#include "migration/qemu-file.h"
> >>>>>    
> >>>>>    #define TYPE_VFIO_PCI "vfio-pci"
> >>>>>    #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> >>>>> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> >>>>>        }
> >>>>>    }
> >>>>>    
> >>>>> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> >>>>> +{
> >>>>> +    PCIDevice *pdev = &vdev->pdev;
> >>>>> +    VFIOBAR *bar = &vdev->bars[nr];
> >>>>> +    uint64_t addr;
> >>>>> +    uint32_t addr_lo, addr_hi = 0;
> >>>>> +
> >>>>> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> >>>>> +    if (!bar->size) {
> >>>>> +        return 0;
> >>>>> +    }
> >>>>> +
> >>>>> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> >>>>> +
> >>>>> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> >>>>> +                                       PCI_BASE_ADDRESS_MEM_MASK);  
> >>>>
> >>>> Nit, &= or combine with previous set.
> >>>>      
> >>>>> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> >>>>> +        addr_hi = pci_default_read_config(pdev,
> >>>>> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> >>>>> +    }
> >>>>> +
> >>>>> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;  
> >>>>
> >>>> Could we use a union?
> >>>>      
> >>>>> +
> >>>>> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> >>>>> +        return -EINVAL;
> >>>>> +    }  
> >>>>
> >>>> What specifically are we validating here?  This should be true no
> >>>> matter what we wrote to the BAR or else BAR emulation is broken.  The
> >>>> bits that could make this unaligned are not implemented in the BAR.
> >>>>      
> >>>>> +
> >>>>> +    return 0;
> >>>>> +}
> >>>>> +
> >>>>> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> >>>>> +{
> >>>>> +    int i, ret;
> >>>>> +
> >>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >>>>> +        ret = vfio_bar_validate(vdev, i);
> >>>>> +        if (ret) {
> >>>>> +            error_report("vfio: BAR address %d validation failed", i);
> >>>>> +            return ret;
> >>>>> +        }
> >>>>> +    }
> >>>>> +    return 0;
> >>>>> +}
> >>>>> +
> >>>>>    static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
> >>>>>    {
> >>>>>        VFIOBAR *bar = &vdev->bars[nr];
> >>>>> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >>>>>        return OBJECT(vdev);
> >>>>>    }
> >>>>>    
> >>>>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> >>>>> +{
> >>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >>>>> +    PCIDevice *pdev = &vdev->pdev;
> >>>>> +    uint16_t pci_cmd;
> >>>>> +    int i;
> >>>>> +
> >>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >>>>> +        uint32_t bar;
> >>>>> +
> >>>>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> >>>>> +        qemu_put_be32(f, bar);
> >>>>> +    }
> >>>>> +
> >>>>> +    qemu_put_be32(f, vdev->interrupt);
> >>>>> +    if (vdev->interrupt == VFIO_INT_MSI) {
> >>>>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >>>>> +        bool msi_64bit;
> >>>>> +
> >>>>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >>>>> +                                            2);
> >>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >>>>> +
> >>>>> +        msi_addr_lo = pci_default_read_config(pdev,
> >>>>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> >>>>> +        qemu_put_be32(f, msi_addr_lo);
> >>>>> +
> >>>>> +        if (msi_64bit) {
> >>>>> +            msi_addr_hi = pci_default_read_config(pdev,
> >>>>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >>>>> +                                             4);
> >>>>> +        }
> >>>>> +        qemu_put_be32(f, msi_addr_hi);
> >>>>> +
> >>>>> +        msi_data = pci_default_read_config(pdev,
> >>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >>>>> +                2);
> >>>>> +        qemu_put_be32(f, msi_data);  
> >>>>
> >>>> Isn't the data field only a u16?
> >>>>      
> >>>
> >>> Yes, fixing it.
> >>>  
> >>>>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> >>>>> +        uint16_t offset;
> >>>>> +
> >>>>> +        /* save enable bit and maskall bit */
> >>>>> +        offset = pci_default_read_config(pdev,
> >>>>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> >>>>> +        qemu_put_be16(f, offset);
> >>>>> +        msix_save(pdev, f);
> >>>>> +    }
> >>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >>>>> +    qemu_put_be16(f, pci_cmd);
> >>>>> +}
> >>>>> +
> >>>>> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> >>>>> +{
> >>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >>>>> +    PCIDevice *pdev = &vdev->pdev;
> >>>>> +    uint32_t interrupt_type;
> >>>>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >>>>> +    uint16_t pci_cmd;
> >>>>> +    bool msi_64bit;
> >>>>> +    int i, ret;
> >>>>> +
> >>>>> +    /* retore pci bar configuration */
> >>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >>>>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> >>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >>>>> +        uint32_t bar = qemu_get_be32(f);
> >>>>> +
> >>>>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> >>>>> +    }
> >>>>> +
> >>>>> +    ret = vfio_bars_validate(vdev);
> >>>>> +    if (ret) {
> >>>>> +        return ret;
> >>>>> +    }
> >>>>> +
> >>>>> +    interrupt_type = qemu_get_be32(f);
> >>>>> +
> >>>>> +    if (interrupt_type == VFIO_INT_MSI) {
> >>>>> +        /* restore msi configuration */
> >>>>> +        msi_flags = pci_default_read_config(pdev,
> >>>>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> >>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >>>>> +
> >>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >>>>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> >>>>> +
> >>>>> +        msi_addr_lo = qemu_get_be32(f);
> >>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> >>>>> +                              msi_addr_lo, 4);
> >>>>> +
> >>>>> +        msi_addr_hi = qemu_get_be32(f);
> >>>>> +        if (msi_64bit) {
> >>>>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >>>>> +                                  msi_addr_hi, 4);
> >>>>> +        }
> >>>>> +        msi_data = qemu_get_be32(f);
> >>>>> +        vfio_pci_write_config(pdev,
> >>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >>>>> +                msi_data, 2);
> >>>>> +
> >>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >>>>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> >>>>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> >>>>> +        uint16_t offset = qemu_get_be16(f);
> >>>>> +
> >>>>> +        /* load enable bit and maskall bit */
> >>>>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> >>>>> +                              offset, 2);
> >>>>> +        msix_load(pdev, f);
> >>>>> +    }
> >>>>> +    pci_cmd = qemu_get_be16(f);
> >>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> >>>>> +    return 0;
> >>>>> +}  
> >>>>
> >>>> It always seems like there should be a lot more state than this, and I
> >>>> probably sound like a broken record because I ask every time, but maybe
> >>>> that's a good indication that we (or at least I) need a comment
> >>>> explaining why we only care about these.  For example, what if we
> >>>> migrate a device in the D3 power state, don't we need to account for
> >>>> the state stored in the PM capability or does the device wake up into
> >>>> D0 auto-magically after migration?  I think we could repeat that
> >>>> question for every capability that can be modified.  Even for the MSI/X
> >>>> cases, the interrupt may not be active, but there could be state in
> >>>> virtual config space that would be different on the target.  For
> >>>> example, if we migrate with a device in INTx mode where the guest had
> >>>> written vector fields on the source, but only writes the enable bit on
> >>>> the target, can we seamlessly figure out the rest?  For other
> >>>> capabilities, that state may represent config space changes written
> >>>> through to the physical device and represent a functional difference on
> >>>> the target.  Thanks,
> >>>>     
> >>>
> >>> These are very basic set of registers from config state. Other are more
> >>> of vendor specific which vendor driver can save and restore in their own
> >>> data. I don't think we have to take care of all those vendor specific
> >>> fields here.  
> >>
> >> That had not been clear to me.  Intel folks, is this your understanding
> >> regarding the responsibility of the user to save and restore config
> >> space of the device as part of the vendor provided migration stream
> >> data?  Thanks,
> >>  
> > Currently, the code works for us. but I agree with you that there should
> > be more states to save, at least for emulated config bits.
> > I think we should call pci_device_save() to serve that purpose.
> >   
> 
> If vendor driver can restore all vendor specific config space, then 
> adding it again in QEMU might be redundant. As an example, I had mailed 
> mtty sample code, in which config space has vendor specific information 
> and that is restored in easy way.

The redundancy is implementing it in each vendor driver.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-05-06 20:03             ` Alex Williamson
@ 2020-05-07  5:40               ` Kirti Wankhede
  2020-05-07 18:14                 ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-07  5:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies, Yan Zhao,
	Liu, Changpeng, Ken.Xue



On 5/7/2020 1:33 AM, Alex Williamson wrote:
> On Thu, 7 May 2020 01:18:19 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 5/6/2020 11:41 AM, Yan Zhao wrote:
>>> On Tue, May 05, 2020 at 12:37:11PM +0800, Alex Williamson wrote:
>>>> On Tue, 5 May 2020 04:48:37 +0530
>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>   
>>>>> On 3/26/2020 1:26 AM, Alex Williamson wrote:
>>>>>> On Wed, 25 Mar 2020 02:39:02 +0530
>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>       
>>>>>>> These functions save and restore PCI device specific data - config
>>>>>>> space of PCI device.
>>>>>>> Tested save and restore with MSI and MSIX type.
>>>>>>>
>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>>> ---
>>>>>>>     hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>     include/hw/vfio/vfio-common.h |   2 +
>>>>>>>     2 files changed, 165 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index 6c77c12e44b9..8deb11e87ef7 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -41,6 +41,7 @@
>>>>>>>     #include "trace.h"
>>>>>>>     #include "qapi/error.h"
>>>>>>>     #include "migration/blocker.h"
>>>>>>> +#include "migration/qemu-file.h"
>>>>>>>     
>>>>>>>     #define TYPE_VFIO_PCI "vfio-pci"
>>>>>>>     #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>>>>>>> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>>>>>>>         }
>>>>>>>     }
>>>>>>>     
>>>>>>> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
>>>>>>> +{
>>>>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>>>>> +    VFIOBAR *bar = &vdev->bars[nr];
>>>>>>> +    uint64_t addr;
>>>>>>> +    uint32_t addr_lo, addr_hi = 0;
>>>>>>> +
>>>>>>> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
>>>>>>> +    if (!bar->size) {
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
>>>>>>> +
>>>>>>> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
>>>>>>> +                                       PCI_BASE_ADDRESS_MEM_MASK);
>>>>>>
>>>>>> Nit, &= or combine with previous set.
>>>>>>       
>>>>>>> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
>>>>>>> +        addr_hi = pci_default_read_config(pdev,
>>>>>>> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
>>>>>>
>>>>>> Could we use a union?
>>>>>>       
>>>>>>> +
>>>>>>> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
>>>>>>> +        return -EINVAL;
>>>>>>> +    }
>>>>>>
>>>>>> What specifically are we validating here?  This should be true no
>>>>>> matter what we wrote to the BAR or else BAR emulation is broken.  The
>>>>>> bits that could make this unaligned are not implemented in the BAR.
>>>>>>       
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
>>>>>>> +{
>>>>>>> +    int i, ret;
>>>>>>> +
>>>>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>>>>>>> +        ret = vfio_bar_validate(vdev, i);
>>>>>>> +        if (ret) {
>>>>>>> +            error_report("vfio: BAR address %d validation failed", i);
>>>>>>> +            return ret;
>>>>>>> +        }
>>>>>>> +    }
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>>     static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>>>>>>     {
>>>>>>>         VFIOBAR *bar = &vdev->bars[nr];
>>>>>>> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>>>>>>         return OBJECT(vdev);
>>>>>>>     }
>>>>>>>     
>>>>>>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>>>>>>> +{
>>>>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>>>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>>>>> +    uint16_t pci_cmd;
>>>>>>> +    int i;
>>>>>>> +
>>>>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>>>>>>> +        uint32_t bar;
>>>>>>> +
>>>>>>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>>>>>>> +        qemu_put_be32(f, bar);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    qemu_put_be32(f, vdev->interrupt);
>>>>>>> +    if (vdev->interrupt == VFIO_INT_MSI) {
>>>>>>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>>>>>>> +        bool msi_64bit;
>>>>>>> +
>>>>>>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>>>>>>> +                                            2);
>>>>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>>>>>>> +
>>>>>>> +        msi_addr_lo = pci_default_read_config(pdev,
>>>>>>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>>>>>>> +        qemu_put_be32(f, msi_addr_lo);
>>>>>>> +
>>>>>>> +        if (msi_64bit) {
>>>>>>> +            msi_addr_hi = pci_default_read_config(pdev,
>>>>>>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>>>>>>> +                                             4);
>>>>>>> +        }
>>>>>>> +        qemu_put_be32(f, msi_addr_hi);
>>>>>>> +
>>>>>>> +        msi_data = pci_default_read_config(pdev,
>>>>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>>>>>>> +                2);
>>>>>>> +        qemu_put_be32(f, msi_data);
>>>>>>
>>>>>> Isn't the data field only a u16?
>>>>>>       
>>>>>
>>>>> Yes, fixing it.
>>>>>   
>>>>>>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
>>>>>>> +        uint16_t offset;
>>>>>>> +
>>>>>>> +        /* save enable bit and maskall bit */
>>>>>>> +        offset = pci_default_read_config(pdev,
>>>>>>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>>>>>>> +        qemu_put_be16(f, offset);
>>>>>>> +        msix_save(pdev, f);
>>>>>>> +    }
>>>>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>>>>>>> +    qemu_put_be16(f, pci_cmd);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>>>>>>> +{
>>>>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>>>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>>>>> +    uint32_t interrupt_type;
>>>>>>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>>>>>>> +    uint16_t pci_cmd;
>>>>>>> +    bool msi_64bit;
>>>>>>> +    int i, ret;
>>>>>>> +
>>>>>>> +    /* retore pci bar configuration */
>>>>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>>>>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>>>>>>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
>>>>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>>>>>>> +        uint32_t bar = qemu_get_be32(f);
>>>>>>> +
>>>>>>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    ret = vfio_bars_validate(vdev);
>>>>>>> +    if (ret) {
>>>>>>> +        return ret;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    interrupt_type = qemu_get_be32(f);
>>>>>>> +
>>>>>>> +    if (interrupt_type == VFIO_INT_MSI) {
>>>>>>> +        /* restore msi configuration */
>>>>>>> +        msi_flags = pci_default_read_config(pdev,
>>>>>>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>>>>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>>>>>>> +
>>>>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>>>>>>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>>>>>>> +
>>>>>>> +        msi_addr_lo = qemu_get_be32(f);
>>>>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>>>>>>> +                              msi_addr_lo, 4);
>>>>>>> +
>>>>>>> +        msi_addr_hi = qemu_get_be32(f);
>>>>>>> +        if (msi_64bit) {
>>>>>>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>>>>>>> +                                  msi_addr_hi, 4);
>>>>>>> +        }
>>>>>>> +        msi_data = qemu_get_be32(f);
>>>>>>> +        vfio_pci_write_config(pdev,
>>>>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>>>>>>> +                msi_data, 2);
>>>>>>> +
>>>>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>>>>>>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>>>>>>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>>>>>>> +        uint16_t offset = qemu_get_be16(f);
>>>>>>> +
>>>>>>> +        /* load enable bit and maskall bit */
>>>>>>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>>>>>>> +                              offset, 2);
>>>>>>> +        msix_load(pdev, f);
>>>>>>> +    }
>>>>>>> +    pci_cmd = qemu_get_be16(f);
>>>>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>
>>>>>> It always seems like there should be a lot more state than this, and I
>>>>>> probably sound like a broken record because I ask every time, but maybe
>>>>>> that's a good indication that we (or at least I) need a comment
>>>>>> explaining why we only care about these.  For example, what if we
>>>>>> migrate a device in the D3 power state, don't we need to account for
>>>>>> the state stored in the PM capability or does the device wake up into
>>>>>> D0 auto-magically after migration?  I think we could repeat that
>>>>>> question for every capability that can be modified.  Even for the MSI/X
>>>>>> cases, the interrupt may not be active, but there could be state in
>>>>>> virtual config space that would be different on the target.  For
>>>>>> example, if we migrate with a device in INTx mode where the guest had
>>>>>> written vector fields on the source, but only writes the enable bit on
>>>>>> the target, can we seamlessly figure out the rest?  For other
>>>>>> capabilities, that state may represent config space changes written
>>>>>> through to the physical device and represent a functional difference on
>>>>>> the target.  Thanks,
>>>>>>      
>>>>>
>>>>> These are very basic set of registers from config state. Other are more
>>>>> of vendor specific which vendor driver can save and restore in their own
>>>>> data. I don't think we have to take care of all those vendor specific
>>>>> fields here.
>>>>
>>>> That had not been clear to me.  Intel folks, is this your understanding
>>>> regarding the responsibility of the user to save and restore config
>>>> space of the device as part of the vendor provided migration stream
>>>> data?  Thanks,
>>>>   
>>> Currently, the code works for us. but I agree with you that there should
>>> be more states to save, at least for emulated config bits.
>>> I think we should call pci_device_save() to serve that purpose.
>>>    
>>
>> If vendor driver can restore all vendor specific config space, then
>> adding it again in QEMU might be redundant. As an example, I had mailed
>> mtty sample code, in which config space has vendor specific information
>> and that is restored in easy way.
> 
> The redundancy is implementing it in each vendor driver.  Thanks,
> 

Vendor driver knows better about vendor specific configs, isn't it 
better vendor driver handle those at their end rather than adding vendor 
specific quirks in QEMU?

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-05-06 19:30               ` Kirti Wankhede
@ 2020-05-07  6:37                 ` Cornelia Huck
  2020-05-07 20:29                 ` Alex Williamson
  1 sibling, 0 replies; 74+ messages in thread
From: Cornelia Huck @ 2020-05-07  6:37 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye, qemu-devel,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	Dr. David Alan Gilbert, Wang, Zhi A, mlevitsk, pasic, aik,
	Alex Williamson, eauger, felipe, jonathan.davies, Yan Zhao, Liu,
	Changpeng, Ken.Xue

On Thu, 7 May 2020 01:00:05 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 5/6/2020 10:23 PM, Dr. David Alan Gilbert wrote:
> > * Cornelia Huck (cohuck@redhat.com) wrote:  
> >> On Wed, 6 May 2020 02:38:46 -0400
> >> Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>  
> >>> On Tue, May 05, 2020 at 12:37:26PM +0800, Alex Williamson wrote:  
> >>>> It's been a long time, but that doesn't seem like what I was asking.
> >>>> The sysfs version checking is used to select a target that is likely to
> >>>> succeed, but the migration stream is still generated by a user and the
> >>>> vendor driver is still ultimately responsible for validating that
> >>>> stream.  I would hope that a vendor migration stream therefore starts
> >>>> with information similar to that found in the sysfs interface, allowing
> >>>> the receiving vendor driver to validate the source device and vendor
> >>>> software version, such that we can fail an incoming migration that the
> >>>> vendor driver deems incompatible.  Ideally the vendor driver might also
> >>>> include consistency and sequence checking throughout the stream to
> >>>> prevent a malicious user from exploiting the internal operation of the
> >>>> vendor driver.  Thanks,  
> >>
> >> Some kind of somewhat standardized marker for driver/version seems like
> >> a good idea. Further checking is also a good idea, but I think the
> >> details of that need to be left to the individual drivers.  
> > 
> > Standardised markers like that would be useful; although the rules of
> > how to compare them might be a bit vendor specific; but still - it would
> > be good for us to be able to dump something out when it all goes wrong.
> >   
> 
> Such checking should already there in vendor driver. Vendor driver might 
> also support across version migration. I think checking in QEMU again 
> would be redundant. Let vendor driver handle version checks.

Of course the actual rules of what is supported and what not are vendor
driver specific -- but we can still benefit from some standardization.
It ensures that this checking is not forgotten, and it can help with
figuring out what went wrong.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices
  2020-05-07  5:40               ` Kirti Wankhede
@ 2020-05-07 18:14                 ` Alex Williamson
  0 siblings, 0 replies; 74+ messages in thread
From: Alex Williamson @ 2020-05-07 18:14 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies, Yan Zhao,
	Liu, Changpeng, Ken.Xue

On Thu, 7 May 2020 11:10:40 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 5/7/2020 1:33 AM, Alex Williamson wrote:
> > On Thu, 7 May 2020 01:18:19 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 5/6/2020 11:41 AM, Yan Zhao wrote:  
> >>> On Tue, May 05, 2020 at 12:37:11PM +0800, Alex Williamson wrote:  
> >>>> On Tue, 5 May 2020 04:48:37 +0530
> >>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>     
> >>>>> On 3/26/2020 1:26 AM, Alex Williamson wrote:  
> >>>>>> On Wed, 25 Mar 2020 02:39:02 +0530
> >>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>         
> >>>>>>> These functions save and restore PCI device specific data - config
> >>>>>>> space of PCI device.
> >>>>>>> Tested save and restore with MSI and MSIX type.
> >>>>>>>
> >>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>>>>> ---
> >>>>>>>     hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>     include/hw/vfio/vfio-common.h |   2 +
> >>>>>>>     2 files changed, 165 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>>>>>> index 6c77c12e44b9..8deb11e87ef7 100644
> >>>>>>> --- a/hw/vfio/pci.c
> >>>>>>> +++ b/hw/vfio/pci.c
> >>>>>>> @@ -41,6 +41,7 @@
> >>>>>>>     #include "trace.h"
> >>>>>>>     #include "qapi/error.h"
> >>>>>>>     #include "migration/blocker.h"
> >>>>>>> +#include "migration/qemu-file.h"
> >>>>>>>     
> >>>>>>>     #define TYPE_VFIO_PCI "vfio-pci"
> >>>>>>>     #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> >>>>>>> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> >>>>>>>         }
> >>>>>>>     }
> >>>>>>>     
> >>>>>>> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> >>>>>>> +{
> >>>>>>> +    PCIDevice *pdev = &vdev->pdev;
> >>>>>>> +    VFIOBAR *bar = &vdev->bars[nr];
> >>>>>>> +    uint64_t addr;
> >>>>>>> +    uint32_t addr_lo, addr_hi = 0;
> >>>>>>> +
> >>>>>>> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> >>>>>>> +    if (!bar->size) {
> >>>>>>> +        return 0;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> >>>>>>> +
> >>>>>>> +    addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> >>>>>>> +                                       PCI_BASE_ADDRESS_MEM_MASK);  
> >>>>>>
> >>>>>> Nit, &= or combine with previous set.
> >>>>>>         
> >>>>>>> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> >>>>>>> +        addr_hi = pci_default_read_config(pdev,
> >>>>>>> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;  
> >>>>>>
> >>>>>> Could we use a union?
> >>>>>>         
> >>>>>>> +
> >>>>>>> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> >>>>>>> +        return -EINVAL;
> >>>>>>> +    }  
> >>>>>>
> >>>>>> What specifically are we validating here?  This should be true no
> >>>>>> matter what we wrote to the BAR or else BAR emulation is broken.  The
> >>>>>> bits that could make this unaligned are not implemented in the BAR.
> >>>>>>         
> >>>>>>> +
> >>>>>>> +    return 0;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> >>>>>>> +{
> >>>>>>> +    int i, ret;
> >>>>>>> +
> >>>>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >>>>>>> +        ret = vfio_bar_validate(vdev, i);
> >>>>>>> +        if (ret) {
> >>>>>>> +            error_report("vfio: BAR address %d validation failed", i);
> >>>>>>> +            return ret;
> >>>>>>> +        }
> >>>>>>> +    }
> >>>>>>> +    return 0;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>>     static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
> >>>>>>>     {
> >>>>>>>         VFIOBAR *bar = &vdev->bars[nr];
> >>>>>>> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >>>>>>>         return OBJECT(vdev);
> >>>>>>>     }
> >>>>>>>     
> >>>>>>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> >>>>>>> +{
> >>>>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >>>>>>> +    PCIDevice *pdev = &vdev->pdev;
> >>>>>>> +    uint16_t pci_cmd;
> >>>>>>> +    int i;
> >>>>>>> +
> >>>>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >>>>>>> +        uint32_t bar;
> >>>>>>> +
> >>>>>>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> >>>>>>> +        qemu_put_be32(f, bar);
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    qemu_put_be32(f, vdev->interrupt);
> >>>>>>> +    if (vdev->interrupt == VFIO_INT_MSI) {
> >>>>>>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >>>>>>> +        bool msi_64bit;
> >>>>>>> +
> >>>>>>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >>>>>>> +                                            2);
> >>>>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >>>>>>> +
> >>>>>>> +        msi_addr_lo = pci_default_read_config(pdev,
> >>>>>>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> >>>>>>> +        qemu_put_be32(f, msi_addr_lo);
> >>>>>>> +
> >>>>>>> +        if (msi_64bit) {
> >>>>>>> +            msi_addr_hi = pci_default_read_config(pdev,
> >>>>>>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >>>>>>> +                                             4);
> >>>>>>> +        }
> >>>>>>> +        qemu_put_be32(f, msi_addr_hi);
> >>>>>>> +
> >>>>>>> +        msi_data = pci_default_read_config(pdev,
> >>>>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >>>>>>> +                2);
> >>>>>>> +        qemu_put_be32(f, msi_data);  
> >>>>>>
> >>>>>> Isn't the data field only a u16?
> >>>>>>         
> >>>>>
> >>>>> Yes, fixing it.
> >>>>>     
> >>>>>>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> >>>>>>> +        uint16_t offset;
> >>>>>>> +
> >>>>>>> +        /* save enable bit and maskall bit */
> >>>>>>> +        offset = pci_default_read_config(pdev,
> >>>>>>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> >>>>>>> +        qemu_put_be16(f, offset);
> >>>>>>> +        msix_save(pdev, f);
> >>>>>>> +    }
> >>>>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >>>>>>> +    qemu_put_be16(f, pci_cmd);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> >>>>>>> +{
> >>>>>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >>>>>>> +    PCIDevice *pdev = &vdev->pdev;
> >>>>>>> +    uint32_t interrupt_type;
> >>>>>>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >>>>>>> +    uint16_t pci_cmd;
> >>>>>>> +    bool msi_64bit;
> >>>>>>> +    int i, ret;
> >>>>>>> +
> >>>>>>> +    /* retore pci bar configuration */
> >>>>>>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >>>>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >>>>>>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> >>>>>>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >>>>>>> +        uint32_t bar = qemu_get_be32(f);
> >>>>>>> +
> >>>>>>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    ret = vfio_bars_validate(vdev);
> >>>>>>> +    if (ret) {
> >>>>>>> +        return ret;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    interrupt_type = qemu_get_be32(f);
> >>>>>>> +
> >>>>>>> +    if (interrupt_type == VFIO_INT_MSI) {
> >>>>>>> +        /* restore msi configuration */
> >>>>>>> +        msi_flags = pci_default_read_config(pdev,
> >>>>>>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> >>>>>>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >>>>>>> +
> >>>>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >>>>>>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> >>>>>>> +
> >>>>>>> +        msi_addr_lo = qemu_get_be32(f);
> >>>>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> >>>>>>> +                              msi_addr_lo, 4);
> >>>>>>> +
> >>>>>>> +        msi_addr_hi = qemu_get_be32(f);
> >>>>>>> +        if (msi_64bit) {
> >>>>>>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >>>>>>> +                                  msi_addr_hi, 4);
> >>>>>>> +        }
> >>>>>>> +        msi_data = qemu_get_be32(f);
> >>>>>>> +        vfio_pci_write_config(pdev,
> >>>>>>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >>>>>>> +                msi_data, 2);
> >>>>>>> +
> >>>>>>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >>>>>>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> >>>>>>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> >>>>>>> +        uint16_t offset = qemu_get_be16(f);
> >>>>>>> +
> >>>>>>> +        /* load enable bit and maskall bit */
> >>>>>>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> >>>>>>> +                              offset, 2);
> >>>>>>> +        msix_load(pdev, f);
> >>>>>>> +    }
> >>>>>>> +    pci_cmd = qemu_get_be16(f);
> >>>>>>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> >>>>>>> +    return 0;
> >>>>>>> +}  
> >>>>>>
> >>>>>> It always seems like there should be a lot more state than this, and I
> >>>>>> probably sound like a broken record because I ask every time, but maybe
> >>>>>> that's a good indication that we (or at least I) need a comment
> >>>>>> explaining why we only care about these.  For example, what if we
> >>>>>> migrate a device in the D3 power state, don't we need to account for
> >>>>>> the state stored in the PM capability or does the device wake up into
> >>>>>> D0 auto-magically after migration?  I think we could repeat that
> >>>>>> question for every capability that can be modified.  Even for the MSI/X
> >>>>>> cases, the interrupt may not be active, but there could be state in
> >>>>>> virtual config space that would be different on the target.  For
> >>>>>> example, if we migrate with a device in INTx mode where the guest had
> >>>>>> written vector fields on the source, but only writes the enable bit on
> >>>>>> the target, can we seamlessly figure out the rest?  For other
> >>>>>> capabilities, that state may represent config space changes written
> >>>>>> through to the physical device and represent a functional difference on
> >>>>>> the target.  Thanks,
> >>>>>>        
> >>>>>
> >>>>> These are very basic set of registers from config state. Other are more
> >>>>> of vendor specific which vendor driver can save and restore in their own
> >>>>> data. I don't think we have to take care of all those vendor specific
> >>>>> fields here.  
> >>>>
> >>>> That had not been clear to me.  Intel folks, is this your understanding
> >>>> regarding the responsibility of the user to save and restore config
> >>>> space of the device as part of the vendor provided migration stream
> >>>> data?  Thanks,
> >>>>     
> >>> Currently, the code works for us. but I agree with you that there should
> >>> be more states to save, at least for emulated config bits.
> >>> I think we should call pci_device_save() to serve that purpose.
> >>>      
> >>
> >> If vendor driver can restore all vendor specific config space, then
> >> adding it again in QEMU might be redundant. As an example, I had mailed
> >> mtty sample code, in which config space has vendor specific information
> >> and that is restored in easy way.  
> > 
> > The redundancy is implementing it in each vendor driver.  Thanks,
> >   
> 
> Vendor driver knows better about vendor specific configs, isn't it 
> better vendor driver handle those at their end rather than adding vendor 
> specific quirks in QEMU?

Some capabilities, ex. the vendor specific capability, interact with
hardware in ways that are opaque to QEMU, the vendor driver needs to
include that underlying state as part of its migration stream.  Other
capabilities are completely standard, for example the PM capability.
Do we really want every vendor driver to implement their own power
state save and restore?  I think we want to centralize anything we can
in the save/restore process so that we don't see different behavior and
different bugs from every single vendor driver.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device
  2020-05-06 19:30               ` Kirti Wankhede
  2020-05-07  6:37                 ` Cornelia Huck
@ 2020-05-07 20:29                 ` Alex Williamson
  1 sibling, 0 replies; 74+ messages in thread
From: Alex Williamson @ 2020-05-07 20:29 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx@Alibaba-inc.com, Tian, Kevin, Liu, Yi L, cjia,
	eskultet, Yang, Ziye, qemu-devel, Cornelia Huck, shuangtai.tst,
	Dr. David Alan Gilbert, Wang, Zhi A, mlevitsk, pasic, aik,
	eauger, felipe, jonathan.davies, Yan Zhao, Liu, Changpeng,
	Ken.Xue

On Thu, 7 May 2020 01:00:05 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 5/6/2020 10:23 PM, Dr. David Alan Gilbert wrote:
> > * Cornelia Huck (cohuck@redhat.com) wrote:  
> >> On Wed, 6 May 2020 02:38:46 -0400
> >> Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>  
> >>> On Tue, May 05, 2020 at 12:37:26PM +0800, Alex Williamson wrote:  
> >>>> It's been a long time, but that doesn't seem like what I was asking.
> >>>> The sysfs version checking is used to select a target that is likely to
> >>>> succeed, but the migration stream is still generated by a user and the
> >>>> vendor driver is still ultimately responsible for validating that
> >>>> stream.  I would hope that a vendor migration stream therefore starts
> >>>> with information similar to that found in the sysfs interface, allowing
> >>>> the receiving vendor driver to validate the source device and vendor
> >>>> software version, such that we can fail an incoming migration that the
> >>>> vendor driver deems incompatible.  Ideally the vendor driver might also
> >>>> include consistency and sequence checking throughout the stream to
> >>>> prevent a malicious user from exploiting the internal operation of the
> >>>> vendor driver.  Thanks,  
> >>
> >> Some kind of somewhat standardized marker for driver/version seems like
> >> a good idea. Further checking is also a good idea, but I think the
> >> details of that need to be left to the individual drivers.  
> > 
> > Standardised markers like that would be useful; although the rules of
> > how to compare them might be a bit vendor specific; but still - it would
> > be good for us to be able to dump something out when it all goes wrong.
> >   
> 
> Such checking should already there in vendor driver. Vendor driver might 
> also support across version migration. I think checking in QEMU again 
> would be redundant. Let vendor driver handle version checks.
>
> >>>>      
> >>> maybe we can add a rw field migration_version in
> >>> struct vfio_device_migration_info besides sysfs interface ?
> >>>
> >>> when reading it in src, it gets the same string as that from sysfs;
> >>> when writing it in target, it returns success or not to check
> >>> compatibility and fails the migration early in setup phase.  
> >>
> >> Getting both populated from the same source seems like a good idea.
> >>
> >> Not sure if a string is the best value to put into a migration stream;
> >> maybe the sysfs interface can derive a human-readable string from a
> >> more compact value to be put into the migration region (and ultimately
> >> the stream)? Might be overengineering, just thinking out aloud here.  
> > 
> > A string might be OK fi you specify a little about it.

I think we've already hashed through that the version is represented by
a string, but interpretation of that string is reserved for the vendor
driver.  I believe this particular thread started out as a question of
whether QEMU is right to validate target compatibility by comparing the
migration region size versus the source, which I see as an overstep of
leaving the compatibility testing to the vendor driver.  A write
exceeding the migration region is clearly a protocol violation, but
unless we're going to scan the entire migration stream to look for that
violation, it's the vendor driver's business where and how it exposes
data within the region.  IOW, different migration region sizes might
suggest to be suspicious, but nothing in our specification requires
that the target region is at least as big as the source.

If we had a mechanism to report and test the migration version through
this migration API, using similar semantics to the sysfs interface,
what would we actually do with it?  The vendor driver's processing of
an incoming migration stream cannot rely on the user.  I initially
struggled with Kirti's use of "should" rather than "must" in describing
this checking, but I think that might actually be correct.  If a user
chooses to ignore the sysfs interface for compatibility testing, or
otherwise chooses to allow the data stream to be corrupted or
manipulated, I think the only requirement of the vendor driver is to
contain the damage to the user's device.  So, I think we're really
looking at whether it's a benefit to the user to be able to retrieve
the version and test it on the target through the migration API.  IOW,
is it sufficient for QEMU to presume that a well informed agent, that
has already tested the source and target device compatibility, has setup
this migration and that a well supported mdev vendor driver should fail
the migration gracefully if the versions are incompatible, or contain
the error within the user's device otherwise, or is there value to be
gained if QEMU performs a separate compatibility test?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-03-24 21:09 ` [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
  2020-03-25 22:03   ` Alex Williamson
@ 2020-05-09  5:31   ` Yan Zhao
  2020-05-11 10:22     ` Kirti Wankhede
  1 sibling, 1 reply; 74+ messages in thread
From: Yan Zhao @ 2020-05-09  5:31 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Wed, Mar 25, 2020 at 05:09:07AM +0800, Kirti Wankhede wrote:
> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes. If pending_bytes > 0, go through below steps.
> - read data_offset - indicates kernel driver to write data to staging
>   buffer.
> - read data_size - amount of data in bytes written by vendor driver in
>   migration region.
I think we should change the sequence of reading data_size and
data_offset. see the next comment below.

> - read data_size bytes of data from data_offset in the migration region.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> c. read data_offset - indicates kernel driver to write data to staging
>    buffer.
> d. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> e. read data_size bytes of data from data_offset in the migration region.
> f. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f while (pending_bytes > 0)
> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from
> data_offset of data_size before moving to next steps.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 245 +++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/trace-events          |   6 ++
>  include/hw/vfio/vfio-common.h |   1 +
>  3 files changed, 251 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 033f76526e49..ecbeed5182c2 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +static void *find_data_region(VFIORegion *region,
> +                              uint64_t data_offset,
> +                              uint64_t data_size)
> +{
> +    void *ptr = NULL;
> +    int i;
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        if ((data_offset >= region->mmaps[i].offset) &&
> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
> +            (data_size <= region->mmaps[i].size)) {
> +            ptr = region->mmaps[i].mmap + (data_offset -
> +                                           region->mmaps[i].offset);
> +            break;
> +        }
> +    }
> +    return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t data_offset = 0, data_size = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret != sizeof(data_offset)) {
> +        error_report("%s: Failed to get migration buffer data offset %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret != sizeof(data_size)) {
> +        error_report("%s: Failed to get migration buffer data size %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
data_size should be read first, and if it's 0, data_offset will not
be read further.

the reasons are below:
1. if there's no data region provided by vendor driver, there's no
reason to get a valid data_offset, so reading/writing of data_offset
should fail. And this should not be treated as a migration error.

2. even if pending_bytes is 0, vfio_save_iterate() is still possible to be
called and therefore vfio_save_buffer() is called.

Thanks
Yan
> +
> +    if (data_size > 0) {
> +        void *buf = NULL;
> +        bool buffer_mmaped;
> +
> +        if (region->mmaps) {
> +            buf = find_data_region(region, data_offset, data_size);
> +        }
> +
> +        buffer_mmaped = (buf != NULL) ? true : false;
> +
> +        if (!buffer_mmaped) {
> +            buf = g_try_malloc0(data_size);
> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                return -ENOMEM;
> +            }
> +
> +            ret = pread(vbasedev->fd, buf, data_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != data_size) {
> +                error_report("%s: Failed to get migration data %d",
> +                             vbasedev->name, ret);
> +                g_free(buf);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, data_size);
> +        qemu_put_buffer(f, buf, data_size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +    } else {
> +        qemu_put_be64(f, data_size);
> +    }
> +
> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +                           migration->pending_bytes);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return data_size;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> +        error_report("%s: Failed to get pending bytes %d",
> +                     vbasedev->name, ret);
> +        migration->pending_bytes = 0;
> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    trace_vfio_save_device_config_state(vbasedev->name);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -154,7 +285,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          qemu_mutex_unlock_iothread();
>          if (ret) {
>              error_report("%s: Failed to mmap VFIO migration region %d: %s",
> -                         vbasedev->name, migration->region.index,
> +                         vbasedev->name, migration->region.nr,
>                           strerror(-ret));
>              return ret;
>          }
> @@ -194,9 +325,121 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    *res_precopy_only += migration->pending_bytes;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> +                            *res_postcopy_only, *res_compatible);
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    int ret, data_size;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    data_size = vfio_save_buffer(f, vbasedev);
> +
> +    if (data_size < 0) {
> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> +                     strerror(errno));
> +        return data_size;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> +    if (data_size == 0) {
> +        /* indicates data finished, goto complete phase */
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOP and SAVING",
> +                     vbasedev->name);
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev);
> +        if (ret < 0) {
> +            error_report("%s: Failed to save buffer", vbasedev->name);
> +            return ret;
> +        } else if (ret == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
> +        return ret;
> +    }
> +
> +    trace_vfio_save_complete_precopy(vbasedev->name);
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 4bb43f18f315..bdf40ba368c7 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -151,3 +151,9 @@ vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_st
>  vfio_migration_state_notifier(char *name, int state) " (%s) state %d"
>  vfio_save_setup(char *name) " (%s)"
>  vfio_save_cleanup(char *name) " (%s)"
> +vfio_save_buffer(char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
> +vfio_update_pending(char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
> +vfio_save_device_config_state(char *name) " (%s)"
> +vfio_save_pending(char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
> +vfio_save_iterate(char *name, int data_size) " (%s) data_size %d"
> +vfio_save_complete_precopy(char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 28f55f66d019..c78033e4149d 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -60,6 +60,7 @@ typedef struct VFIORegion {
>  
>  typedef struct VFIOMigration {
>      VFIORegion region;
> +    uint64_t pending_bytes;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-05-05  4:37       ` Alex Williamson
@ 2020-05-11  9:53         ` Kirti Wankhede
  2020-05-11 15:59           ` Alex Williamson
  2020-05-12  2:06           ` Yan Zhao
  0 siblings, 2 replies; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-11  9:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue



On 5/5/2020 10:07 AM, Alex Williamson wrote:
> On Tue, 5 May 2020 04:48:14 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 3/26/2020 3:33 AM, Alex Williamson wrote:
>>> On Wed, 25 Mar 2020 02:39:07 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>>>> functions. These functions handles pre-copy and stop-and-copy phase.
>>>>
>>>> In _SAVING|_RUNNING device state or pre-copy phase:
>>>> - read pending_bytes. If pending_bytes > 0, go through below steps.
>>>> - read data_offset - indicates kernel driver to write data to staging
>>>>     buffer.
>>>> - read data_size - amount of data in bytes written by vendor driver in
>>>>     migration region.
>>>> - read data_size bytes of data from data_offset in the migration region.
>>>> - Write data packet to file stream as below:
>>>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>>>> VFIO_MIG_FLAG_END_OF_STATE }
>>>>
>>>> In _SAVING device state or stop-and-copy phase
>>>> a. read config space of device and save to migration file stream. This
>>>>      doesn't need to be from vendor driver. Any other special config state
>>>>      from driver can be saved as data in following iteration.
>>>> b. read pending_bytes. If pending_bytes > 0, go through below steps.
>>>> c. read data_offset - indicates kernel driver to write data to staging
>>>>      buffer.
>>>> d. read data_size - amount of data in bytes written by vendor driver in
>>>>      migration region.
>>>> e. read data_size bytes of data from data_offset in the migration region.
>>>> f. Write data packet as below:
>>>>      {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>>>> g. iterate through steps b to f while (pending_bytes > 0)
>>>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>>>
>>>> When data region is mapped, its user's responsibility to read data from
>>>> data_offset of data_size before moving to next steps.
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>    hw/vfio/migration.c           | 245 +++++++++++++++++++++++++++++++++++++++++-
>>>>    hw/vfio/trace-events          |   6 ++
>>>>    include/hw/vfio/vfio-common.h |   1 +
>>>>    3 files changed, 251 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index 033f76526e49..ecbeed5182c2 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>>>        return 0;
>>>>    }
>>>>    
>>>> +static void *find_data_region(VFIORegion *region,
>>>> +                              uint64_t data_offset,
>>>> +                              uint64_t data_size)
>>>> +{
>>>> +    void *ptr = NULL;
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < region->nr_mmaps; i++) {
>>>> +        if ((data_offset >= region->mmaps[i].offset) &&
>>>> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
>>>> +            (data_size <= region->mmaps[i].size)) {
>>>
>>> (data_offset - region->mmaps[i].offset) can be non-zero, so this test
>>> is invalid.  Additionally the uapi does not require that a give data
>>> chunk fits exclusively within an mmap'd area, it may overlap one or
>>> more mmap'd sections of the region, possibly with non-mmap'd areas
>>> included.
>>>    
>>
>> What's the advantage of having mmap and non-mmap overlapped regions?
>> Isn't it better to have data section either mapped or trapped?
> 
> The spec allows for it, therefore we need to support it.  A vendor
> driver might choose to include a header with sequence and checksum
> information for each transaction, they might accomplish this by setting
> data_offset to a trapped area backed by kernel memory followed by an
> area supporting direct mmap to the device.  The target end could then
> fault on writing the header if the sequence information is incorrect.
> A trapped area at the end of the transaction could allow the vendor
> driver to validate a checksum.
> 

If mmap and non-mmap regions overlapped is allowed then here read() 
should be used, which means buffer is allocated, then get data in buffer 
(first memcpy) and then call qemu_put_buffer(f, buf, data_size) (second 
memcpy)

Advantage of using full mmaped region for data, qemu_put_buffer(f, buf, 
data_size) directly uses pointer to mmaped region and so we reduce one 
memcpy.

>>>> +            ptr = region->mmaps[i].mmap + (data_offset -
>>>> +                                           region->mmaps[i].offset);
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +    return ptr;
>>>> +}
>>>> +
>>>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    VFIORegion *region = &migration->region;
>>>> +    uint64_t data_offset = 0, data_size = 0;
>>>> +    int ret;
>>>> +
>>>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>> +                                             data_offset));
>>>> +    if (ret != sizeof(data_offset)) {
>>>> +        error_report("%s: Failed to get migration buffer data offset %d",
>>>> +                     vbasedev->name, ret);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>> +                                             data_size));
>>>> +    if (ret != sizeof(data_size)) {
>>>> +        error_report("%s: Failed to get migration buffer data size %d",
>>>> +                     vbasedev->name, ret);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    if (data_size > 0) {
>>>> +        void *buf = NULL;
>>>> +        bool buffer_mmaped;
>>>> +
>>>> +        if (region->mmaps) {
>>>> +            buf = find_data_region(region, data_offset, data_size);
>>>> +        }
>>>> +
>>>> +        buffer_mmaped = (buf != NULL) ? true : false;
>>>
>>> The ternary is unnecessary, "? true : false" is redundant.
>>>    
>>
>> Removing it.
>>
>>>> +
>>>> +        if (!buffer_mmaped) {
>>>> +            buf = g_try_malloc0(data_size);
>>>
>>> Why do we need zero'd memory?
>>>    
>>
>> Zeroed memory not required, removing 0
>>
>>>> +            if (!buf) {
>>>> +                error_report("%s: Error allocating buffer ", __func__);
>>>> +                return -ENOMEM;
>>>> +            }
>>>> +
>>>> +            ret = pread(vbasedev->fd, buf, data_size,
>>>> +                        region->fd_offset + data_offset);
>>>> +            if (ret != data_size) {
>>>> +                error_report("%s: Failed to get migration data %d",
>>>> +                             vbasedev->name, ret);
>>>> +                g_free(buf);
>>>> +                return -EINVAL;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        qemu_put_be64(f, data_size);
>>>> +        qemu_put_buffer(f, buf, data_size);
>>>
>>> This can segfault when mmap'd given the above assumptions about size
>>> and layout.
>>>    
>>>> +
>>>> +        if (!buffer_mmaped) {
>>>> +            g_free(buf);
>>>> +        }
>>>> +    } else {
>>>> +        qemu_put_be64(f, data_size);
>>>
>>> We insert a zero?  Couldn't we add the section header and end here and
>>> skip it entirely?
>>>    
>>
>> This is used during resuming, data_size 0 indicates end of data.
>>
>>>> +    }
>>>> +
>>>> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
>>>> +                           migration->pending_bytes);
>>>> +
>>>> +    ret = qemu_file_get_error(f);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    return data_size;
>>>> +}
>>>> +
>>>> +static int vfio_update_pending(VFIODevice *vbasedev)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    VFIORegion *region = &migration->region;
>>>> +    uint64_t pending_bytes = 0;
>>>> +    int ret;
>>>> +
>>>> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
>>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>>>> +                                             pending_bytes));
>>>> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
>>>> +        error_report("%s: Failed to get pending bytes %d",
>>>> +                     vbasedev->name, ret);
>>>> +        migration->pending_bytes = 0;
>>>> +        return (ret < 0) ? ret : -EINVAL;
>>>> +    }
>>>> +
>>>> +    migration->pending_bytes = pending_bytes;
>>>> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>>>> +
>>>> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
>>>> +        vbasedev->ops->vfio_save_config(vbasedev, f);
>>>> +    }
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +
>>>> +    trace_vfio_save_device_config_state(vbasedev->name);
>>>> +
>>>> +    return qemu_file_get_error(f);
>>>> +}
>>>> +
>>>>    /* ---------------------------------------------------------------------- */
>>>>    
>>>>    static int vfio_save_setup(QEMUFile *f, void *opaque)
>>>> @@ -154,7 +285,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>>>            qemu_mutex_unlock_iothread();
>>>>            if (ret) {
>>>>                error_report("%s: Failed to mmap VFIO migration region %d: %s",
>>>> -                         vbasedev->name, migration->region.index,
>>>> +                         vbasedev->name, migration->region.nr,
>>>>                             strerror(-ret));
>>>>                return ret;
>>>>            }
>>>> @@ -194,9 +325,121 @@ static void vfio_save_cleanup(void *opaque)
>>>>        trace_vfio_save_cleanup(vbasedev->name);
>>>>    }
>>>>    
>>>> +static void vfio_save_pending(QEMUFile *f, void *opaque,
>>>> +                              uint64_t threshold_size,
>>>> +                              uint64_t *res_precopy_only,
>>>> +                              uint64_t *res_compatible,
>>>> +                              uint64_t *res_postcopy_only)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    int ret;
>>>> +
>>>> +    ret = vfio_update_pending(vbasedev);
>>>> +    if (ret) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    *res_precopy_only += migration->pending_bytes;
>>>> +
>>>> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
>>>> +                            *res_postcopy_only, *res_compatible);
>>>> +}
>>>> +
>>>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    int ret, data_size;
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>>>> +
>>>> +    data_size = vfio_save_buffer(f, vbasedev);
>>>> +
>>>> +    if (data_size < 0) {
>>>> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
>>>> +                     strerror(errno));
>>>> +        return data_size;
>>>> +    }
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +
>>>> +    ret = qemu_file_get_error(f);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    trace_vfio_save_iterate(vbasedev->name, data_size);
>>>> +    if (data_size == 0) {
>>>> +        /* indicates data finished, goto complete phase */
>>>> +        return 1;
>>>
>>> But it's pending_bytes not data_size that indicates we're done.  How do
>>> we get away with ignoring pending_bytes for the save_live_iterate phase?
>>>    
>>
>> This is requirement mentioned above qemu_savevm_state_iterate() which
>> calls .save_live_iterate.
>>
>> /*	
>>    * this function has three return values:
>>    *   negative: there was one error, and we have -errno.
>>    *   0 : We haven't finished, caller have to go again
>>    *   1 : We have finished, we can go to complete phase
>>    */
>> int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
>>
>> This is to serialize savevm_state.handlers (or in other words devices).
> 
> I've lost all context on this question in the interim, but I think this
> highlights my question.  We use pending_bytes to know how close we are
> to the end of the stream and data_size to iterate each transaction
> within that stream.  So how does data_size == 0 indicate we've
> completed the current phase?  It seems like pending_bytes should
> indicate that.  Thanks,
> 

Fixing this by adding a read on pending_bytes if its 0 and return 
accordingly.
     if (migration->pending_bytes == 0) {
         ret = vfio_update_pending(vbasedev);
         if (ret) {
             return ret;
         }

         if (migration->pending_bytes == 0) {
             /* indicates data finished, goto complete phase */
             return 1;
         }
     }

Thanks,
Kirti




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-05-09  5:31   ` Yan Zhao
@ 2020-05-11 10:22     ` Kirti Wankhede
  2020-05-12  0:50       ` Yan Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Kirti Wankhede @ 2020-05-11 10:22 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang,  Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue



On 5/9/2020 11:01 AM, Yan Zhao wrote:
> On Wed, Mar 25, 2020 at 05:09:07AM +0800, Kirti Wankhede wrote:
>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>> functions. These functions handles pre-copy and stop-and-copy phase.
>>
>> In _SAVING|_RUNNING device state or pre-copy phase:
>> - read pending_bytes. If pending_bytes > 0, go through below steps.
>> - read data_offset - indicates kernel driver to write data to staging
>>    buffer.
>> - read data_size - amount of data in bytes written by vendor driver in
>>    migration region.
> I think we should change the sequence of reading data_size and
> data_offset. see the next comment below.
> 
>> - read data_size bytes of data from data_offset in the migration region.
>> - Write data packet to file stream as below:
>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>> VFIO_MIG_FLAG_END_OF_STATE }
>>
>> In _SAVING device state or stop-and-copy phase
>> a. read config space of device and save to migration file stream. This
>>     doesn't need to be from vendor driver. Any other special config state
>>     from driver can be saved as data in following iteration.
>> b. read pending_bytes. If pending_bytes > 0, go through below steps.
>> c. read data_offset - indicates kernel driver to write data to staging
>>     buffer.
>> d. read data_size - amount of data in bytes written by vendor driver in
>>     migration region.
>> e. read data_size bytes of data from data_offset in the migration region.
>> f. Write data packet as below:
>>     {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>> g. iterate through steps b to f while (pending_bytes > 0)
>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>
>> When data region is mapped, its user's responsibility to read data from
>> data_offset of data_size before moving to next steps.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c           | 245 +++++++++++++++++++++++++++++++++++++++++-
>>   hw/vfio/trace-events          |   6 ++
>>   include/hw/vfio/vfio-common.h |   1 +
>>   3 files changed, 251 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 033f76526e49..ecbeed5182c2 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +static void *find_data_region(VFIORegion *region,
>> +                              uint64_t data_offset,
>> +                              uint64_t data_size)
>> +{
>> +    void *ptr = NULL;
>> +    int i;
>> +
>> +    for (i = 0; i < region->nr_mmaps; i++) {
>> +        if ((data_offset >= region->mmaps[i].offset) &&
>> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
>> +            (data_size <= region->mmaps[i].size)) {
>> +            ptr = region->mmaps[i].mmap + (data_offset -
>> +                                           region->mmaps[i].offset);
>> +            break;
>> +        }
>> +    }
>> +    return ptr;
>> +}
>> +
>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    uint64_t data_offset = 0, data_size = 0;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +    if (ret != sizeof(data_offset)) {
>> +        error_report("%s: Failed to get migration buffer data offset %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_size));
>> +    if (ret != sizeof(data_size)) {
>> +        error_report("%s: Failed to get migration buffer data size %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
> data_size should be read first, and if it's 0, data_offset will not
> be read further.
> 
> the reasons are below:
> 1. if there's no data region provided by vendor driver, there's no
> reason to get a valid data_offset, so reading/writing of data_offset
> should fail. And this should not be treated as a migration error.
> 
> 2. even if pending_bytes is 0, vfio_save_iterate() is still possible to be
> called and therefore vfio_save_buffer() is called.
> 

As I mentioned in reply to Alex in:
https://lists.nongnu.org/archive/html/qemu-devel/2020-05/msg02476.html

With that, vfio_save_iterate() will read pending_bytes if its 0 and then 
if pending_bytes is not 0 then call vfio_save_buffer(). With that your 
above concerns should get resolved.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-05-11  9:53         ` Kirti Wankhede
@ 2020-05-11 15:59           ` Alex Williamson
  2020-05-12  2:06           ` Yan Zhao
  1 sibling, 0 replies; 74+ messages in thread
From: Alex Williamson @ 2020-05-11 15:59 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Mon, 11 May 2020 15:23:37 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 5/5/2020 10:07 AM, Alex Williamson wrote:
> > On Tue, 5 May 2020 04:48:14 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 3/26/2020 3:33 AM, Alex Williamson wrote:  
> >>> On Wed, 25 Mar 2020 02:39:07 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> >>>> functions. These functions handles pre-copy and stop-and-copy phase.
> >>>>
> >>>> In _SAVING|_RUNNING device state or pre-copy phase:
> >>>> - read pending_bytes. If pending_bytes > 0, go through below steps.
> >>>> - read data_offset - indicates kernel driver to write data to staging
> >>>>     buffer.
> >>>> - read data_size - amount of data in bytes written by vendor driver in
> >>>>     migration region.
> >>>> - read data_size bytes of data from data_offset in the migration region.
> >>>> - Write data packet to file stream as below:
> >>>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> >>>> VFIO_MIG_FLAG_END_OF_STATE }
> >>>>
> >>>> In _SAVING device state or stop-and-copy phase
> >>>> a. read config space of device and save to migration file stream. This
> >>>>      doesn't need to be from vendor driver. Any other special config state
> >>>>      from driver can be saved as data in following iteration.
> >>>> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> >>>> c. read data_offset - indicates kernel driver to write data to staging
> >>>>      buffer.
> >>>> d. read data_size - amount of data in bytes written by vendor driver in
> >>>>      migration region.
> >>>> e. read data_size bytes of data from data_offset in the migration region.
> >>>> f. Write data packet as below:
> >>>>      {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> >>>> g. iterate through steps b to f while (pending_bytes > 0)
> >>>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> >>>>
> >>>> When data region is mapped, its user's responsibility to read data from
> >>>> data_offset of data_size before moving to next steps.
> >>>>
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>> ---
> >>>>    hw/vfio/migration.c           | 245 +++++++++++++++++++++++++++++++++++++++++-
> >>>>    hw/vfio/trace-events          |   6 ++
> >>>>    include/hw/vfio/vfio-common.h |   1 +
> >>>>    3 files changed, 251 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>>> index 033f76526e49..ecbeed5182c2 100644
> >>>> --- a/hw/vfio/migration.c
> >>>> +++ b/hw/vfio/migration.c
> >>>> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >>>>        return 0;
> >>>>    }
> >>>>    
> >>>> +static void *find_data_region(VFIORegion *region,
> >>>> +                              uint64_t data_offset,
> >>>> +                              uint64_t data_size)
> >>>> +{
> >>>> +    void *ptr = NULL;
> >>>> +    int i;
> >>>> +
> >>>> +    for (i = 0; i < region->nr_mmaps; i++) {
> >>>> +        if ((data_offset >= region->mmaps[i].offset) &&
> >>>> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
> >>>> +            (data_size <= region->mmaps[i].size)) {  
> >>>
> >>> (data_offset - region->mmaps[i].offset) can be non-zero, so this test
> >>> is invalid.  Additionally the uapi does not require that a give data
> >>> chunk fits exclusively within an mmap'd area, it may overlap one or
> >>> more mmap'd sections of the region, possibly with non-mmap'd areas
> >>> included.
> >>>      
> >>
> >> What's the advantage of having mmap and non-mmap overlapped regions?
> >> Isn't it better to have data section either mapped or trapped?  
> > 
> > The spec allows for it, therefore we need to support it.  A vendor
> > driver might choose to include a header with sequence and checksum
> > information for each transaction, they might accomplish this by setting
> > data_offset to a trapped area backed by kernel memory followed by an
> > area supporting direct mmap to the device.  The target end could then
> > fault on writing the header if the sequence information is incorrect.
> > A trapped area at the end of the transaction could allow the vendor
> > driver to validate a checksum.
> >   
> 
> If mmap and non-mmap regions overlapped is allowed then here read() 
> should be used, which means buffer is allocated, then get data in buffer 
> (first memcpy) and then call qemu_put_buffer(f, buf, data_size) (second 
> memcpy)
> 
> Advantage of using full mmaped region for data, qemu_put_buffer(f, buf, 
> data_size) directly uses pointer to mmaped region and so we reduce one 
> memcpy.

If userspace wants to use read() for intermixed data ranges, they may.
Using the mmap access for portions that allow it can improve
efficiency.  None of this changes the fact that the code I pointed out
here has a bug and that bug still exists in v18:

+static void *find_data_region(VFIORegion *region,
+                              uint64_t data_offset,
+                              uint64_t data_size)
+{
+    void *ptr = NULL;
+    int i;
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if ((data_offset >= region->mmaps[i].offset) &&
+            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&

These two tests verify that we have:

region->mmaps[i].offset <= data_offset < (region->mmaps[i].offset + region->mmaps[i].size)

ie. data_offset is somewhere within the start of the mmap capable range.

+            (data_size <= region->mmaps[i].size)) {

This makes the wild assumption that data_offset == region->mmaps[i].offset

+            ptr = region->mmaps[i].mmap + (data_offset -
+                                           region->mmaps[i].offset);
+            break;
+        }
+    }
+    return ptr;
+}

Therefore, if the *start* of the data offset falls within an mmap, we
pass back an mmap pointer to the caller who goes on to assume the
entire data_size is available through that pointer.  I think the latter
tests needs to be something like:

(data_size <= region->mmaps[i].size - (data_offset - region->mmaps[i].offset))

Or a more optimal solution might be to pass back a size and iterate
over the data in chunks, using mmap for the extents available.  Thanks,

Alex

> >>>> +            ptr = region->mmaps[i].mmap + (data_offset -
> >>>> +                                           region->mmaps[i].offset);
> >>>> +            break;
> >>>> +        }
> >>>> +    }
> >>>> +    return ptr;
> >>>> +}
> >>>> +
> >>>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> >>>> +{
> >>>> +    VFIOMigration *migration = vbasedev->migration;
> >>>> +    VFIORegion *region = &migration->region;
> >>>> +    uint64_t data_offset = 0, data_size = 0;
> >>>> +    int ret;
> >>>> +
> >>>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >>>> +                                             data_offset));
> >>>> +    if (ret != sizeof(data_offset)) {
> >>>> +        error_report("%s: Failed to get migration buffer data offset %d",
> >>>> +                     vbasedev->name, ret);
> >>>> +        return -EINVAL;
> >>>> +    }
> >>>> +
> >>>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> >>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >>>> +                                             data_size));
> >>>> +    if (ret != sizeof(data_size)) {
> >>>> +        error_report("%s: Failed to get migration buffer data size %d",
> >>>> +                     vbasedev->name, ret);
> >>>> +        return -EINVAL;
> >>>> +    }
> >>>> +
> >>>> +    if (data_size > 0) {
> >>>> +        void *buf = NULL;
> >>>> +        bool buffer_mmaped;
> >>>> +
> >>>> +        if (region->mmaps) {
> >>>> +            buf = find_data_region(region, data_offset, data_size);
> >>>> +        }
> >>>> +
> >>>> +        buffer_mmaped = (buf != NULL) ? true : false;  
> >>>
> >>> The ternary is unnecessary, "? true : false" is redundant.
> >>>      
> >>
> >> Removing it.
> >>  
> >>>> +
> >>>> +        if (!buffer_mmaped) {
> >>>> +            buf = g_try_malloc0(data_size);  
> >>>
> >>> Why do we need zero'd memory?
> >>>      
> >>
> >> Zeroed memory not required, removing 0
> >>  
> >>>> +            if (!buf) {
> >>>> +                error_report("%s: Error allocating buffer ", __func__);
> >>>> +                return -ENOMEM;
> >>>> +            }
> >>>> +
> >>>> +            ret = pread(vbasedev->fd, buf, data_size,
> >>>> +                        region->fd_offset + data_offset);
> >>>> +            if (ret != data_size) {
> >>>> +                error_report("%s: Failed to get migration data %d",
> >>>> +                             vbasedev->name, ret);
> >>>> +                g_free(buf);
> >>>> +                return -EINVAL;
> >>>> +            }
> >>>> +        }
> >>>> +
> >>>> +        qemu_put_be64(f, data_size);
> >>>> +        qemu_put_buffer(f, buf, data_size);  
> >>>
> >>> This can segfault when mmap'd given the above assumptions about size
> >>> and layout.
> >>>      
> >>>> +
> >>>> +        if (!buffer_mmaped) {
> >>>> +            g_free(buf);
> >>>> +        }
> >>>> +    } else {
> >>>> +        qemu_put_be64(f, data_size);  
> >>>
> >>> We insert a zero?  Couldn't we add the section header and end here and
> >>> skip it entirely?
> >>>      
> >>
> >> This is used during resuming, data_size 0 indicates end of data.
> >>  
> >>>> +    }
> >>>> +
> >>>> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> >>>> +                           migration->pending_bytes);
> >>>> +
> >>>> +    ret = qemu_file_get_error(f);
> >>>> +    if (ret) {
> >>>> +        return ret;
> >>>> +    }
> >>>> +
> >>>> +    return data_size;
> >>>> +}
> >>>> +
> >>>> +static int vfio_update_pending(VFIODevice *vbasedev)
> >>>> +{
> >>>> +    VFIOMigration *migration = vbasedev->migration;
> >>>> +    VFIORegion *region = &migration->region;
> >>>> +    uint64_t pending_bytes = 0;
> >>>> +    int ret;
> >>>> +
> >>>> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> >>>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >>>> +                                             pending_bytes));
> >>>> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> >>>> +        error_report("%s: Failed to get pending bytes %d",
> >>>> +                     vbasedev->name, ret);
> >>>> +        migration->pending_bytes = 0;
> >>>> +        return (ret < 0) ? ret : -EINVAL;
> >>>> +    }
> >>>> +
> >>>> +    migration->pending_bytes = pending_bytes;
> >>>> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> >>>> +{
> >>>> +    VFIODevice *vbasedev = opaque;
> >>>> +
> >>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> >>>> +
> >>>> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> >>>> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> >>>> +    }
> >>>> +
> >>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >>>> +
> >>>> +    trace_vfio_save_device_config_state(vbasedev->name);
> >>>> +
> >>>> +    return qemu_file_get_error(f);
> >>>> +}
> >>>> +
> >>>>    /* ---------------------------------------------------------------------- */
> >>>>    
> >>>>    static int vfio_save_setup(QEMUFile *f, void *opaque)
> >>>> @@ -154,7 +285,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
> >>>>            qemu_mutex_unlock_iothread();
> >>>>            if (ret) {
> >>>>                error_report("%s: Failed to mmap VFIO migration region %d: %s",
> >>>> -                         vbasedev->name, migration->region.index,
> >>>> +                         vbasedev->name, migration->region.nr,
> >>>>                             strerror(-ret));
> >>>>                return ret;
> >>>>            }
> >>>> @@ -194,9 +325,121 @@ static void vfio_save_cleanup(void *opaque)
> >>>>        trace_vfio_save_cleanup(vbasedev->name);
> >>>>    }
> >>>>    
> >>>> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> >>>> +                              uint64_t threshold_size,
> >>>> +                              uint64_t *res_precopy_only,
> >>>> +                              uint64_t *res_compatible,
> >>>> +                              uint64_t *res_postcopy_only)
> >>>> +{
> >>>> +    VFIODevice *vbasedev = opaque;
> >>>> +    VFIOMigration *migration = vbasedev->migration;
> >>>> +    int ret;
> >>>> +
> >>>> +    ret = vfio_update_pending(vbasedev);
> >>>> +    if (ret) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    *res_precopy_only += migration->pending_bytes;
> >>>> +
> >>>> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> >>>> +                            *res_postcopy_only, *res_compatible);
> >>>> +}
> >>>> +
> >>>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >>>> +{
> >>>> +    VFIODevice *vbasedev = opaque;
> >>>> +    int ret, data_size;
> >>>> +
> >>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> >>>> +
> >>>> +    data_size = vfio_save_buffer(f, vbasedev);
> >>>> +
> >>>> +    if (data_size < 0) {
> >>>> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> >>>> +                     strerror(errno));
> >>>> +        return data_size;
> >>>> +    }
> >>>> +
> >>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >>>> +
> >>>> +    ret = qemu_file_get_error(f);
> >>>> +    if (ret) {
> >>>> +        return ret;
> >>>> +    }
> >>>> +
> >>>> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> >>>> +    if (data_size == 0) {
> >>>> +        /* indicates data finished, goto complete phase */
> >>>> +        return 1;  
> >>>
> >>> But it's pending_bytes not data_size that indicates we're done.  How do
> >>> we get away with ignoring pending_bytes for the save_live_iterate phase?
> >>>      
> >>
> >> This is requirement mentioned above qemu_savevm_state_iterate() which
> >> calls .save_live_iterate.
> >>
> >> /*	
> >>    * this function has three return values:
> >>    *   negative: there was one error, and we have -errno.
> >>    *   0 : We haven't finished, caller have to go again
> >>    *   1 : We have finished, we can go to complete phase
> >>    */
> >> int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
> >>
> >> This is to serialize savevm_state.handlers (or in other words devices).  
> > 
> > I've lost all context on this question in the interim, but I think this
> > highlights my question.  We use pending_bytes to know how close we are
> > to the end of the stream and data_size to iterate each transaction
> > within that stream.  So how does data_size == 0 indicate we've
> > completed the current phase?  It seems like pending_bytes should
> > indicate that.  Thanks,
> >   
> 
> Fixing this by adding a read on pending_bytes if its 0 and return 
> accordingly.
>      if (migration->pending_bytes == 0) {
>          ret = vfio_update_pending(vbasedev);
>          if (ret) {
>              return ret;
>          }
> 
>          if (migration->pending_bytes == 0) {
>              /* indicates data finished, goto complete phase */
>              return 1;
>          }
>      }
> 
> Thanks,
> Kirti
> 
> 



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-05-11 10:22     ` Kirti Wankhede
@ 2020-05-12  0:50       ` Yan Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yan Zhao @ 2020-05-12  0:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Mon, May 11, 2020 at 06:22:47PM +0800, Kirti Wankhede wrote:
> 
> 
> On 5/9/2020 11:01 AM, Yan Zhao wrote:
> > On Wed, Mar 25, 2020 at 05:09:07AM +0800, Kirti Wankhede wrote:
> >> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> >> functions. These functions handles pre-copy and stop-and-copy phase.
> >>
> >> In _SAVING|_RUNNING device state or pre-copy phase:
> >> - read pending_bytes. If pending_bytes > 0, go through below steps.
> >> - read data_offset - indicates kernel driver to write data to staging
> >>    buffer.
> >> - read data_size - amount of data in bytes written by vendor driver in
> >>    migration region.
> > I think we should change the sequence of reading data_size and
> > data_offset. see the next comment below.
> > 
> >> - read data_size bytes of data from data_offset in the migration region.
> >> - Write data packet to file stream as below:
> >> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> >> VFIO_MIG_FLAG_END_OF_STATE }
> >>
> >> In _SAVING device state or stop-and-copy phase
> >> a. read config space of device and save to migration file stream. This
> >>     doesn't need to be from vendor driver. Any other special config state
> >>     from driver can be saved as data in following iteration.
> >> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> >> c. read data_offset - indicates kernel driver to write data to staging
> >>     buffer.
> >> d. read data_size - amount of data in bytes written by vendor driver in
> >>     migration region.
> >> e. read data_size bytes of data from data_offset in the migration region.
> >> f. Write data packet as below:
> >>     {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> >> g. iterate through steps b to f while (pending_bytes > 0)
> >> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> >>
> >> When data region is mapped, its user's responsibility to read data from
> >> data_offset of data_size before moving to next steps.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/migration.c           | 245 +++++++++++++++++++++++++++++++++++++++++-
> >>   hw/vfio/trace-events          |   6 ++
> >>   include/hw/vfio/vfio-common.h |   1 +
> >>   3 files changed, 251 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 033f76526e49..ecbeed5182c2 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >>       return 0;
> >>   }
> >>   
> >> +static void *find_data_region(VFIORegion *region,
> >> +                              uint64_t data_offset,
> >> +                              uint64_t data_size)
> >> +{
> >> +    void *ptr = NULL;
> >> +    int i;
> >> +
> >> +    for (i = 0; i < region->nr_mmaps; i++) {
> >> +        if ((data_offset >= region->mmaps[i].offset) &&
> >> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
> >> +            (data_size <= region->mmaps[i].size)) {
> >> +            ptr = region->mmaps[i].mmap + (data_offset -
> >> +                                           region->mmaps[i].offset);
> >> +            break;
> >> +        }
> >> +    }
> >> +    return ptr;
> >> +}
> >> +
> >> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region;
> >> +    uint64_t data_offset = 0, data_size = 0;
> >> +    int ret;
> >> +
> >> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data_offset));
> >> +    if (ret != sizeof(data_offset)) {
> >> +        error_report("%s: Failed to get migration buffer data offset %d",
> >> +                     vbasedev->name, ret);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data_size));
> >> +    if (ret != sizeof(data_size)) {
> >> +        error_report("%s: Failed to get migration buffer data size %d",
> >> +                     vbasedev->name, ret);
> >> +        return -EINVAL;
> >> +    }
> > data_size should be read first, and if it's 0, data_offset will not
> > be read further.
> > 
> > the reasons are below:
> > 1. if there's no data region provided by vendor driver, there's no
> > reason to get a valid data_offset, so reading/writing of data_offset
> > should fail. And this should not be treated as a migration error.
> > 
> > 2. even if pending_bytes is 0, vfio_save_iterate() is still possible to be
> > called and therefore vfio_save_buffer() is called.
> > 
> 
> As I mentioned in reply to Alex in:
> https://lists.nongnu.org/archive/html/qemu-devel/2020-05/msg02476.html
> 
> With that, vfio_save_iterate() will read pending_bytes if its 0 and then 
> if pending_bytes is not 0 then call vfio_save_buffer(). With that your 
> above concerns should get resolved.
>
what if pending_bytes is not 0, but vendor driver just does not want to
send data in this iteration? isn't it right to get data_size first before
getting data_offset?

Thanks
Yan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers
  2020-05-11  9:53         ` Kirti Wankhede
  2020-05-11 15:59           ` Alex Williamson
@ 2020-05-12  2:06           ` Yan Zhao
  1 sibling, 0 replies; 74+ messages in thread
From: Yan Zhao @ 2020-05-12  2:06 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, eskultet, Yang, Ziye,
	qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, Alex Williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Mon, May 11, 2020 at 05:53:37PM +0800, Kirti Wankhede wrote:
> 
> 
> On 5/5/2020 10:07 AM, Alex Williamson wrote:
> > On Tue, 5 May 2020 04:48:14 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> >> On 3/26/2020 3:33 AM, Alex Williamson wrote:
> >>> On Wed, 25 Mar 2020 02:39:07 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>    

<...>

> >>>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >>>> +{
> >>>> +    VFIODevice *vbasedev = opaque;
> >>>> +    int ret, data_size;
> >>>> +
> >>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> >>>> +
> >>>> +    data_size = vfio_save_buffer(f, vbasedev);
> >>>> +
> >>>> +    if (data_size < 0) {
> >>>> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> >>>> +                     strerror(errno));
> >>>> +        return data_size;
> >>>> +    }
> >>>> +
> >>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >>>> +
> >>>> +    ret = qemu_file_get_error(f);
> >>>> +    if (ret) {
> >>>> +        return ret;
> >>>> +    }
> >>>> +
> >>>> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> >>>> +    if (data_size == 0) {
> >>>> +        /* indicates data finished, goto complete phase */
> >>>> +        return 1;
> >>>
> >>> But it's pending_bytes not data_size that indicates we're done.  How do
> >>> we get away with ignoring pending_bytes for the save_live_iterate phase?
> >>>    
> >>
> >> This is requirement mentioned above qemu_savevm_state_iterate() which
> >> calls .save_live_iterate.
> >>
> >> /*	
> >>    * this function has three return values:
> >>    *   negative: there was one error, and we have -errno.
> >>    *   0 : We haven't finished, caller have to go again
> >>    *   1 : We have finished, we can go to complete phase
> >>    */
> >> int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
> >>
> >> This is to serialize savevm_state.handlers (or in other words devices).
> > 
> > I've lost all context on this question in the interim, but I think this
> > highlights my question.  We use pending_bytes to know how close we are
> > to the end of the stream and data_size to iterate each transaction
> > within that stream.  So how does data_size == 0 indicate we've
> > completed the current phase?  It seems like pending_bytes should
> > indicate that.  Thanks,
> > 
> 
> Fixing this by adding a read on pending_bytes if its 0 and return 
> accordingly.
>      if (migration->pending_bytes == 0) {
>          ret = vfio_update_pending(vbasedev);
>          if (ret) {
>              return ret;
>          }
> 
>          if (migration->pending_bytes == 0) {
>              /* indicates data finished, goto complete phase */
>              return 1;
>          }
>      }
> 

just a question. if 1 is only returned when migration->pending_bytes is 0,
does that mean .save_live_iterate of vmstates after "vfio-pci"
would never be called until migration->pending_bytes is 0 ?

as in qemu_savevm_state_iterate(),

qemu_savevm_state_iterate {
...
  QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
  	...
 	ret = se->ops->save_live_iterate(f, se->opaque);
	...
	if (ret <= 0) {
            /* Do not proceed to the next vmstate before this one reported
               completion of the current stage. This serializes the migration
               and reduces the probability that a faster changing state is
               synchronized over and over again. */
            break;
        }
  }
  return ret;
}

in ram's migration code, its pending_bytes(remaining_size) is only updated in
ram_save_pending() when it's below threshold, which means in
ram_save_iterate() the pending_bytes is possible to be 0, so other
vmstates have their chance to be called.

Thanks
Yan



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v16 QEMU 05/16] vfio: Add migration region initialization and finalize function
  2020-05-04 23:19     ` Kirti Wankhede
@ 2020-05-19 19:32       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 74+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-19 19:32 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, eskultet, ziye.yang,
	cohuck, shuangtai.tst, qemu-devel, zhi.a.wang, mlevitsk, pasic,
	aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 3/26/2020 11:22 PM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > - Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
> > >    patch series.
> > > - VFIO device supports migration or not is decided based of migration region
> > >    query. If migration region query is successful and migration region
> > >    initialization is successful then migration is supported else migration is
> > >    blocked.
> > > 
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > ---
> > >   hw/vfio/Makefile.objs         |   2 +-
> > >   hw/vfio/migration.c           | 138 ++++++++++++++++++++++++++++++++++++++++++
> > >   hw/vfio/trace-events          |   3 +
> > >   include/hw/vfio/vfio-common.h |   9 +++
> > >   4 files changed, 151 insertions(+), 1 deletion(-)
> > >   create mode 100644 hw/vfio/migration.c
> > > 
> > > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > > index 9bb1c09e8477..8b296c889ed9 100644
> > > --- a/hw/vfio/Makefile.objs
> > > +++ b/hw/vfio/Makefile.objs
> > > @@ -1,4 +1,4 @@
> > > -obj-y += common.o spapr.o
> > > +obj-y += common.o spapr.o migration.o
> > >   obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
> > >   obj-$(CONFIG_VFIO_CCW) += ccw.o
> > >   obj-$(CONFIG_VFIO_PLATFORM) += platform.o
> > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > new file mode 100644
> > > index 000000000000..a078dcf1dd8f
> > > --- /dev/null
> > > +++ b/hw/vfio/migration.c
> > > @@ -0,0 +1,138 @@
> > > +/*
> > > + * Migration support for VFIO devices
> > > + *
> > > + * Copyright NVIDIA, Inc. 2019
> > 
> > Time flies by...
> > 
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2. See
> > > + * the COPYING file in the top-level directory.
> > 
> > Are you sure you want this to be V2 only? Most code added to qemu now is
> > v2 or later.
> > 
> 
> I kept it same as in files vfio-pci and hw/vfio/common.c
> 
> Should it be different? Can you give some reference what it should be?

It's OK to be v2; there's a statement in LICENSE saying hw/vfio is
acceptable for that; most other new code we try and make
'version 2 or later' - if you want it as 2 that is fine though.

Dave

> Thanks,
> Kirti
> 
> > > + */
> > > +
> > > +#include "qemu/osdep.h"
> > > +#include <linux/vfio.h>
> > > +
> > > +#include "hw/vfio/vfio-common.h"
> > > +#include "cpu.h"
> > > +#include "migration/migration.h"
> > > +#include "migration/qemu-file.h"
> > > +#include "migration/register.h"
> > > +#include "migration/blocker.h"
> > > +#include "migration/misc.h"
> > > +#include "qapi/error.h"
> > > +#include "exec/ramlist.h"
> > > +#include "exec/ram_addr.h"
> > > +#include "pci.h"
> > > +#include "trace.h"
> > > +
> > > +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> > > +{
> > > +    VFIOMigration *migration = vbasedev->migration;
> > > +
> > > +    if (!migration) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (migration->region.size) {
> > > +        vfio_region_exit(&migration->region);
> > > +        vfio_region_finalize(&migration->region);
> > > +    }
> > > +}
> > > +
> > > +static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
> > > +{
> > > +    VFIOMigration *migration = vbasedev->migration;
> > > +    Object *obj = NULL;
> > > +    int ret = -EINVAL;
> > > +
> > > +    if (!vbasedev->ops->vfio_get_object) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> > > +    if (!obj) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
> > > +                            "migration");
> > > +    if (ret) {
> > > +        error_report("%s: Failed to setup VFIO migration region %d: %s",
> > > +                     vbasedev->name, index, strerror(-ret));
> > > +        goto err;
> > > +    }
> > > +
> > > +    if (!migration->region.size) {
> > > +        ret = -EINVAL;
> > > +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
> > > +                     vbasedev->name, index, strerror(-ret));
> > > +        goto err;
> > > +    }
> > > +
> > > +    return 0;
> > > +
> > > +err:
> > > +    vfio_migration_region_exit(vbasedev);
> > > +    return ret;
> > > +}
> > > +
> > > +static int vfio_migration_init(VFIODevice *vbasedev,
> > > +                               struct vfio_region_info *info)
> > > +{
> > > +    int ret;
> > > +
> > > +    vbasedev->migration = g_new0(VFIOMigration, 1);
> > > +
> > > +    ret = vfio_migration_region_init(vbasedev, info->index);
> > > +    if (ret) {
> > > +        error_report("%s: Failed to initialise migration region",
> > > +                     vbasedev->name);
> > > +        g_free(vbasedev->migration);
> > > +        vbasedev->migration = NULL;
> > > +        return ret;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +/* ---------------------------------------------------------------------- */
> > > +
> > > +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> > > +{
> > > +    struct vfio_region_info *info;
> > > +    Error *local_err = NULL;
> > > +    int ret;
> > > +
> > > +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> > > +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> > > +    if (ret) {
> > > +        goto add_blocker;
> > > +    }
> > > +
> > > +    ret = vfio_migration_init(vbasedev, info);
> > > +    if (ret) {
> > > +        goto add_blocker;
> > > +    }
> > > +
> > > +    trace_vfio_migration_probe(vbasedev->name, info->index);
> > > +    return 0;
> > > +
> > > +add_blocker:
> > > +    error_setg(&vbasedev->migration_blocker,
> > > +               "VFIO device doesn't support migration");
> > > +    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> > > +    if (local_err) {
> > > +        error_propagate(errp, local_err);
> > > +        error_free(vbasedev->migration_blocker);
> > > +    }
> > > +    return ret;
> > > +}
> > > +
> > > +void vfio_migration_finalize(VFIODevice *vbasedev)
> > > +{
> > > +    if (vbasedev->migration_blocker) {
> > > +        migrate_del_blocker(vbasedev->migration_blocker);
> > > +        error_free(vbasedev->migration_blocker);
> > > +    }
> > > +
> > > +    vfio_migration_region_exit(vbasedev);
> > > +    g_free(vbasedev->migration);
> > > +}
> > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > > index 8cdc27946cb8..191a726a1312 100644
> > > --- a/hw/vfio/trace-events
> > > +++ b/hw/vfio/trace-events
> > > @@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
> > >   vfio_display_edid_link_down(void) ""
> > >   vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
> > >   vfio_display_edid_write_error(void) ""
> > > +
> > > +# migration.c
> > > +vfio_migration_probe(char *name, uint32_t index) " (%s) Region %d"
> > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > > index d69a7f3ae31e..d4b268641173 100644
> > > --- a/include/hw/vfio/vfio-common.h
> > > +++ b/include/hw/vfio/vfio-common.h
> > > @@ -57,6 +57,10 @@ typedef struct VFIORegion {
> > >       uint8_t nr; /* cache the region number for debug */
> > >   } VFIORegion;
> > > +typedef struct VFIOMigration {
> > > +    VFIORegion region;
> > > +} VFIOMigration;
> > > +
> > >   typedef struct VFIOAddressSpace {
> > >       AddressSpace *as;
> > >       QLIST_HEAD(, VFIOContainer) containers;
> > > @@ -113,6 +117,8 @@ typedef struct VFIODevice {
> > >       unsigned int num_irqs;
> > >       unsigned int num_regions;
> > >       unsigned int flags;
> > > +    VFIOMigration *migration;
> > > +    Error *migration_blocker;
> > >   } VFIODevice;
> > >   struct VFIODeviceOps {
> > > @@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
> > >   int vfio_spapr_remove_window(VFIOContainer *container,
> > >                                hwaddr offset_within_address_space);
> > > +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> > > +void vfio_migration_finalize(VFIODevice *vbasedev);
> > > +
> > >   #endif /* HW_VFIO_VFIO_COMMON_H */
> > > -- 
> > > 2.7.0
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, back to index

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-24 21:08 [PATCH v16 QEMU 00/16] Add migration support for VFIO devices Kirti Wankhede
2020-03-24 21:08 ` [PATCH v16 QEMU 01/16] vfio: KABI for migration interface - Kernel header placeholder Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 02/16] vfio: Add function to unmap VFIO region Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 03/16] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2020-03-25 19:56   ` Alex Williamson
2020-03-26 17:29     ` Dr. David Alan Gilbert
2020-03-26 17:38       ` Alex Williamson
2020-05-04 23:18     ` Kirti Wankhede
2020-05-05  4:37       ` Alex Williamson
2020-05-06  6:11         ` Yan Zhao
2020-05-06 19:48           ` Kirti Wankhede
2020-05-06 20:03             ` Alex Williamson
2020-05-07  5:40               ` Kirti Wankhede
2020-05-07 18:14                 ` Alex Williamson
2020-03-26 17:46   ` Dr. David Alan Gilbert
2020-05-04 23:19     ` Kirti Wankhede
2020-04-07  4:10   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2020-05-04 23:21     ` Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 05/16] vfio: Add migration region initialization and finalize function Kirti Wankhede
2020-03-26 17:52   ` Dr. David Alan Gilbert
2020-05-04 23:19     ` Kirti Wankhede
2020-05-19 19:32       ` Dr. David Alan Gilbert
2020-03-24 21:09 ` [PATCH v16 QEMU 06/16] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 07/16] vfio: Add migration state change notifier Kirti Wankhede
2020-04-01 11:27   ` Dr. David Alan Gilbert
2020-05-04 23:20     ` Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2020-03-25 21:02   ` Alex Williamson
2020-05-04 23:19     ` Kirti Wankhede
2020-05-05  4:37       ` Alex Williamson
2020-05-06  6:38         ` Yan Zhao
2020-05-06  9:58           ` Cornelia Huck
2020-05-06 16:53             ` Dr. David Alan Gilbert
2020-05-06 19:30               ` Kirti Wankhede
2020-05-07  6:37                 ` Cornelia Huck
2020-05-07 20:29                 ` Alex Williamson
2020-04-01 17:36   ` Dr. David Alan Gilbert
2020-05-04 23:20     ` Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2020-03-25 22:03   ` Alex Williamson
2020-05-04 23:18     ` Kirti Wankhede
2020-05-05  4:37       ` Alex Williamson
2020-05-11  9:53         ` Kirti Wankhede
2020-05-11 15:59           ` Alex Williamson
2020-05-12  2:06           ` Yan Zhao
2020-05-09  5:31   ` Yan Zhao
2020-05-11 10:22     ` Kirti Wankhede
2020-05-12  0:50       ` Yan Zhao
2020-03-24 21:09 ` [PATCH v16 QEMU 10/16] vfio: Add load " Kirti Wankhede
2020-03-25 22:36   ` Alex Williamson
2020-04-01 18:58   ` Dr. David Alan Gilbert
2020-05-04 23:20     ` Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 11/16] iommu: add callback to get address limit IOMMU supports Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 12/16] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
2020-04-01 19:00   ` Dr. David Alan Gilbert
2020-04-01 19:42     ` Alex Williamson
2020-03-24 21:09 ` [PATCH v16 QEMU 13/16] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
2020-03-26 19:10   ` Alex Williamson
2020-05-04 23:20     ` Kirti Wankhede
2020-04-01 19:03   ` Dr. David Alan Gilbert
2020-05-04 23:21     ` Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 14/16] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
2020-03-25  2:19   ` Yan Zhao
2020-03-26 19:46   ` Alex Williamson
2020-04-01 19:08     ` Dr. David Alan Gilbert
2020-04-01  5:50   ` Yan Zhao
2020-04-03 20:11     ` Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 15/16] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
2020-03-24 21:09 ` [PATCH v16 QEMU 16/16] vfio: Make vfio-pci device migration capable Kirti Wankhede
2020-03-24 23:36 ` [PATCH v16 QEMU 00/16] Add migration support for VFIO devices no-reply
2020-03-31 18:34 ` Alex Williamson
2020-04-01  6:41   ` Yan Zhao
2020-04-01 18:34     ` Alex Williamson

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git