qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH QEMU v23 00/18] Add migration support for VFIO devices
@ 2020-05-20 18:24 Kirti Wankhede
  2020-05-20 18:24 ` [PATCH QEMU v23 01/18] vfio: KABI for migration interface - Kernel header placeholder Kirti Wankhede
                   ` (17 more replies)
  0 siblings, 18 replies; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Hi,

I had not yet addressed all the concerns/comments raised on previous versions
of QEMU. I'm sending this QEMU side code to make code available to test other
devices with latest kernel UAPI. I will revisit comments on previous QEMU
patches soon.

This Patch set adds migration support for VFIO devices in QEMU.

This Patch set include patches as below:
Patch 1:
- Define KABI for VFIO device for migration support for device state and newly
  added ioctl definations to get dirty pages bitmap. This is a placeholder
  patch.

Patch 2-4:
- Few code refactor
- Added save and restore functions for PCI configuration space

Patch 5-10:
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.

Patch 11-12
- Add helper function for migration with vIOMMU enabled to get address limit
  IOMMU supports.
- Set DIRTY_MEMORY_MIGRATION flag in dirty log mask for migration with vIOMMU
  enabled.

Patch 13-14:
- Add function to start and stop dirty pages tracking.
- Add vfio_listerner_log_sync to mark dirty pages. Dirty pages bitmap is queried
  per container. All pages pinned by vendor driver through vfio_pin_pages
  external API has to be marked as dirty during  migration.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned by vendor driver
  should be considered as dirty.
  In Qemu, marking pages dirty is only done when device is in stop-and-copy
  phase because if pages are marked dirty during pre-copy phase and content is
  transfered from source to distination, there is no way to know newly dirtied
  pages from the point they were copied earlier until device stops. To avoid
  repeated copy of same content, pinned pages are marked dirty only during
  stop-and-copy phase.

Patch 15:
- Get migration capability from kernel module.

Patch 16:
- With vIOMMU, IO virtual address range can get unmapped while in pre-copy
  phase of migration. In that case, unmap ioctl should return pages pinned
  in that range and QEMU should report corresponding guest physical pages
  dirty.

Patch 17:
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Patch 18:
- Added VFIO device stats to MigrationInfo

Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required in kernel such that
vendor driver could report dirty pages to VFIO module during migration phases.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    On migration completion, vCPUs stops and calls .save_live_complete_precopy
    for each active device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, _STOPPED)
                        |
    (FINISH_MIGRATE, _COMPLETED, STOPPED)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, STOPPED)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
                        |
        (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.

v22 -> v23
-- Fixed issue reported by Yan
https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd531c@nvidia.com/
- Sending this version to test v23 kernel version patches:
https://lore.kernel.org/kvm/1589998088-3250-1-git-send-email-kwankhede@nvidia.com/

v18 -> v22
- Few fixes from v18 review. But not yet fixed all concerns. I'll address those
  concerns in subsequent iterations.
- Sending this version to test v22 kernel version patches:
https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankhede@nvidia.com/

v16 -> v18
- Nit fixes
- Get migration capability flags from container
- Added VFIO stats to MigrationInfo
- Fixed bug reported by Yan
    https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg00004.html

v9 -> v16
- KABI almost finalised on kernel patches.
- Added support for migration with vIOMMU enabled.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU sets.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load as
  suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti




Kirti Wankhede (18):
  vfio: KABI for migration interface - Kernel header placeholder
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add VM state change handler to know state of VM
  vfio: Add migration state change notifier
  vfio: Register SaveVMHandlers for VFIO device
  vfio: Add save state functions to SaveVMHandlers
  vfio: Add load state functions to SaveVMHandlers
  iommu: add callback to get address limit IOMMU supports
  memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  vfio: Get migration capability flags for container
  vfio: Add function to start and stop dirty pages tracking
  vfio: Add vfio_listener_log_sync to mark dirty pages
  vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  vfio: Make vfio-pci device migration capable
  qapi: Add VFIO devices migration stats in Migration stats

 hw/i386/intel_iommu.c         |   9 +
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              | 411 +++++++++++++++++++--
 hw/vfio/migration.c           | 804 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 203 +++++++++--
 hw/vfio/pci.h                 |   1 -
 hw/vfio/trace-events          |  19 +
 include/exec/memory.h         |  18 +
 include/hw/vfio/vfio-common.h |  22 ++
 include/qemu/vfio-helpers.h   |   3 +
 linux-headers/linux/vfio.h    | 318 ++++++++++++++++-
 memory.c                      |  13 +-
 migration/migration.c         |  18 +
 monitor/hmp-cmds.c            |   6 +
 qapi/migration.json           |  19 +-
 15 files changed, 1818 insertions(+), 48 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 01/18] vfio: KABI for migration interface - Kernel header placeholder
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-20 18:24 ` [PATCH QEMU v23 02/18] vfio: Add function to unmap VFIO region Kirti Wankhede
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Kernel header patches are being reviewed along with kernel side changes.
This patch is only for place holder.

This patch include all changes in vfio.h from above patch set

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 318 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 316 insertions(+), 2 deletions(-)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index a41c45286511..7076d48e1ec0 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
+#define VFIO_REGION_TYPE_MIGRATION              (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -379,6 +380,232 @@ struct vfio_region_gfx_edid {
 /* sub-types for VFIO_REGION_TYPE_CCW */
 #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
 
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
+
+/*
+ * The structure vfio_device_migration_info is placed at the 0th offset of
+ * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
+ * migration information. Field accesses from this structure are only supported
+ * at their native width and alignment. Otherwise, the result is undefined and
+ * vendor drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      - The user application writes to this field to inform the vendor driver
+ *        about the device state to be transitioned to.
+ *      - The vendor driver should take the necessary actions to change the
+ *        device state. After successful transition to a given state, the
+ *        vendor driver should return success on write(device_state, state)
+ *        system call. If the device state transition fails, the vendor driver
+ *        should return an appropriate -errno for the fault condition.
+ *      - On the user application side, if the device state transition fails,
+ *	  that is, if write(device_state, state) returns an error, read
+ *	  device_state again to determine the current state of the device from
+ *	  the vendor driver.
+ *      - The vendor driver should return previous state of the device unless
+ *        the vendor driver has encountered an internal error, in which case
+ *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
+ *      - The user application must use the device reset ioctl to recover the
+ *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
+ *        indicated to be in a valid device state by reading device_state, the
+ *        user application may attempt to transition the device to any valid
+ *        state reachable from the current state or terminate itself.
+ *
+ *      device_state consists of 3 bits:
+ *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
+ *        it indicates the _STOP state. When the device state is changed to
+ *        _STOP, driver should stop the device before write() returns.
+ *      - If bit 1 is set, it indicates the _SAVING state, which means that the
+ *        driver should start gathering device state information that will be
+ *        provided to the VFIO user application to save the device's state.
+ *      - If bit 2 is set, it indicates the _RESUMING state, which means that
+ *        the driver should prepare to resume the device. Data provided through
+ *        the migration region should be used to resume the device.
+ *      Bits 3 - 31 are reserved for future use. To preserve them, the user
+ *      application should perform a read-modify-write operation on this
+ *      field when modifying the specified bits.
+ *
+ *  +------- _RESUMING
+ *  |+------ _SAVING
+ *  ||+----- _RUNNING
+ *  |||
+ *  000b => Device Stopped, not saving or resuming
+ *  001b => Device running, which is the default state
+ *  010b => Stop the device & save the device state, stop-and-copy state
+ *  011b => Device running and save the device state, pre-copy state
+ *  100b => Device stopped and the device state is resuming
+ *  101b => Invalid state
+ *  110b => Error state
+ *  111b => Invalid state
+ *
+ * State transitions:
+ *
+ *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
+ *                (100b)     (001b)     (011b)        (010b)       (000b)
+ * 0. Running or default state
+ *                             |
+ *
+ * 1. Normal Shutdown (optional)
+ *                             |------------------------------------->|
+ *
+ * 2. Save the state or suspend
+ *                             |------------------------->|---------->|
+ *
+ * 3. Save the state during live migration
+ *                             |----------->|------------>|---------->|
+ *
+ * 4. Resuming
+ *                  |<---------|
+ *
+ * 5. Resumed
+ *                  |--------->|
+ *
+ * 0. Default state of VFIO device is _RUNNNG when the user application starts.
+ * 1. During normal shutdown of the user application, the user application may
+ *    optionally change the VFIO device state from _RUNNING to _STOP. This
+ *    transition is optional. The vendor driver must support this transition but
+ *    must not require it.
+ * 2. When the user application saves state or suspends the application, the
+ *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
+ *    On state transition from _RUNNING to stop-and-copy, driver must stop the
+ *    device, save the device state and send it to the application through the
+ *    migration region. The sequence to be followed for such transition is given
+ *    below.
+ * 3. In live migration of user application, the state transitions from _RUNNING
+ *    to pre-copy, to stop-and-copy, and to _STOP.
+ *    On state transition from _RUNNING to pre-copy, the driver should start
+ *    gathering the device state while the application is still running and send
+ *    the device state data to application through the migration region.
+ *    On state transition from pre-copy to stop-and-copy, the driver must stop
+ *    the device, save the device state and send it to the user application
+ *    through the migration region.
+ *    Vendor drivers must support the pre-copy state even for implementations
+ *    where no data is provided to the user before the stop-and-copy state. The
+ *    user must not be required to consume all migration data before the device
+ *    transitions to a new state, including the stop-and-copy state.
+ *    The sequence to be followed for above two transitions is given below.
+ * 4. To start the resuming phase, the device state should be transitioned from
+ *    the _RUNNING to the _RESUMING state.
+ *    In the _RESUMING state, the driver should use the device state data
+ *    received through the migration region to resume the device.
+ * 5. After providing saved device data to the driver, the application should
+ *    change the state from _RESUMING to _RUNNING.
+ *
+ * reserved:
+ *      Reads on this field return zero and writes are ignored.
+ *
+ * pending_bytes: (read only)
+ *      The number of pending bytes still to be migrated from the vendor driver.
+ *
+ * data_offset: (read only)
+ *      The user application should read data_offset in the migration region
+ *      from where the user application should read the device data during the
+ *      _SAVING state or write the device data during the _RESUMING state. See
+ *      below for details of sequence to be followed.
+ *
+ * data_size: (read/write)
+ *      The user application should read data_size to get the size in bytes of
+ *      the data copied in the migration region during the _SAVING state and
+ *      write the size in bytes of the data copied in the migration region
+ *      during the _RESUMING state.
+ *
+ * The format of the migration region is as follows:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^
+ *  offset 0-trapped part        data_offset
+ *
+ * The structure vfio_device_migration_info is always followed by the data
+ * section in the region, so data_offset will always be nonzero. The offset
+ * from where the data is copied is decided by the kernel driver. The data
+ * section can be trapped, mapped, or partitioned, depending on how the kernel
+ * driver defines the data section. The data section partition can be defined
+ * as mapped by the sparse mmap capability. If mmapped, data_offset should be
+ * page aligned, whereas initial section which contains the
+ * vfio_device_migration_info structure, might not end at the offset, which is
+ * page aligned. The user is not required to access through mmap regardless
+ * of the capabilities of the region mmap.
+ * The vendor driver should determine whether and how to partition the data
+ * section. The vendor driver should return data_offset accordingly.
+ *
+ * The sequence to be followed for the _SAVING|_RUNNING device state or
+ * pre-copy phase and for the _SAVING device state or stop-and-copy phase is as
+ * follows:
+ * a. Read pending_bytes, indicating the start of a new iteration to get device
+ *    data. Repeated read on pending_bytes at this stage should have no side
+ *    effects.
+ *    If pending_bytes == 0, the user application should not iterate to get data
+ *    for that device.
+ *    If pending_bytes > 0, perform the following steps.
+ * b. Read data_offset, indicating that the vendor driver should make data
+ *    available through the data section. The vendor driver should return this
+ *    read operation only after data is available from (region + data_offset)
+ *    to (region + data_offset + data_size).
+ * c. Read data_size, which is the amount of data in bytes available through
+ *    the migration region.
+ *    Read on data_offset and data_size should return the offset and size of
+ *    the current buffer if the user application reads data_offset and
+ *    data_size more than once here.
+ * d. Read data_size bytes of data from (region + data_offset) from the
+ *    migration region.
+ * e. Process the data.
+ * f. Read pending_bytes, which indicates that the data from the previous
+ *    iteration has been read. If pending_bytes > 0, go to step b.
+ *
+ * If an error occurs during the above sequence, the vendor driver can return
+ * an error code for next read() or write() operation, which will terminate the
+ * loop. The user application should then take the next necessary action, for
+ * example, failing migration or terminating the user application.
+ *
+ * The user application can transition from the _SAVING|_RUNNING
+ * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
+ * number of pending bytes. The user application should iterate in _SAVING
+ * (stop-and-copy) until pending_bytes is 0.
+ *
+ * The sequence to be followed while _RESUMING device state is as follows:
+ * While data for this device is available, repeat the following steps:
+ * a. Read data_offset from where the user application should write data.
+ * b. Write migration data starting at the migration region + data_offset for
+ *    the length determined by data_size from the migration source.
+ * c. Write data_size, which indicates to the vendor driver that data is
+ *    written in the migration region. Vendor driver should apply the
+ *    user-provided migration region data to the device resume state.
+ *
+ * For the user application, data is opaque. The user application should write
+ * data in the same order as the data is received and the data should be of
+ * same transaction size at the source.
+ */
+
+struct vfio_device_migration_info {
+	__u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_STOP      (0)
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+				     VFIO_DEVICE_STATE_SAVING |  \
+				     VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_VALID(state) \
+	(state & VFIO_DEVICE_STATE_RESUMING ? \
+	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
+
+#define VFIO_DEVICE_STATE_IS_ERROR(state) \
+	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
+					      VFIO_DEVICE_STATE_RESUMING))
+
+#define VFIO_DEVICE_STATE_SET_ERROR(state) \
+	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
+					     VFIO_DEVICE_STATE_RESUMING)
+
+	__u32 reserved;
+	__u64 pending_bytes;
+	__u64 data_offset;
+	__u64 data_size;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
@@ -757,8 +984,8 @@ struct vfio_device_feature {
 struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
-#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
-#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
+#define VFIO_IOMMU_INFO_PGSIZES   (1 << 0) /* supported page sizes info */
+#define VFIO_IOMMU_INFO_CAPS      (1 << 1) /* Info supports caps */
 	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
 };
@@ -785,6 +1012,28 @@ struct vfio_iommu_type1_info_cap_iova_range {
 	struct	vfio_iova_range iova_ranges[];
 };
 
+/*
+ * The migration capability allows to report supported features for migration.
+ *
+ * The structures below define version 1 of this capability.
+ *
+ * The existence of this capability indicates IOMMU kernel driver supports
+ * dirty page tracking.
+ *
+ * pgsize_bitmap: Kernel driver returns supported page sizes bitmap for dirty
+ * page tracking.
+ * max_dirty_bitmap_size: Kernel driver returns maximum supported dirty bitmap
+ * size in bytes to be used by user application for ioctls to get dirty bitmap.
+ */
+#define VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION  1
+
+struct vfio_iommu_type1_info_cap_migration {
+        struct  vfio_info_cap_header header;
+        __u32   flags;
+        __u64   pgsize_bitmap;
+        __u64   max_dirty_bitmap_size;          /* in bytes */
+};
+
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
 
 /**
@@ -805,6 +1054,12 @@ struct vfio_iommu_type1_dma_map {
 
 #define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
 
+struct vfio_bitmap {
+    __u64        pgsize;           /* page size for bitmap */
+    __u64        size;             /* in bytes */
+    __u64        *data;     /* one bit per page */
+};
+
 /**
  * VFIO_IOMMU_UNMAP_DMA - _IOWR(VFIO_TYPE, VFIO_BASE + 14,
  *							struct vfio_dma_unmap)
@@ -814,12 +1069,23 @@ struct vfio_iommu_type1_dma_map {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
+ * before unmapping IO virtual addresses. When this flag is set, user must
+ * provide data[] as structure vfio_bitmap. User must allocate memory to get
+ * bitmap, clear the bitmap memory by setting zero and must set size of
+ * allocated memory in vfio_bitmap.size field. One bit in bitmap
+ * represents per page, page of user provided page size in 'pgsize',
+ * consecutively starting from iova offset. Bit set indicates page at that
+ * offset from iova is dirty. Bitmap of pages in the range of unmapped size is
+ * returned in vfio_bitmap.data
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
+        __u8    data[];
 };
 
 #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
@@ -831,6 +1097,54 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_DIRTY_PAGES - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
+ *                                     struct vfio_iommu_type1_dirty_bitmap)
+ * IOCTL is used for dirty pages tracking. Caller sets argsz, which is size of
+ * struct vfio_iommu_type1_dirty_bitmap. Caller set flag depend on which
+ * operation to perform, details as below:
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_START set, indicates
+ * migration is active and IOMMU module should track pages which are pinned and
+ * could be dirtied by device.
+ * Dirty pages are tracked until tracking is stopped by user application by
+ * setting VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP flag.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP set, indicates
+ * IOMMU should stop tracking pinned pages.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP flag set,
+ * IOCTL returns dirty pages bitmap for IOMMU container during migration for
+ * given IOVA range. User must provide data[] as the structure
+ * vfio_iommu_type1_dirty_bitmap_get through which user provides IOVA range and
+ * pgsize. This interface supports to get bitmap of smallest supported pgsize
+ * only and can be modified in future to get bitmap of specified pgsize.
+ * User must allocate memory for bitmap, zero the bitmap memory and set size
+ * of allocated memory in bitmap_size field. One bit is used to represent one
+ * page consecutively starting from iova offset. User should provide page size
+ * in 'pgsize'. Bit set in bitmap indicates page at that offset from iova is
+ * dirty. Caller must set argsz including size of structure
+ * vfio_iommu_type1_dirty_bitmap_get.
+ *
+ * Only one flag should be set at a time.
+ */
+struct vfio_iommu_type1_dirty_bitmap {
+	__u32        argsz;
+	__u32        flags;
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_START	(1 << 0)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP	(1 << 1)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP	(1 << 2)
+	__u8         data[];
+};
+
+struct vfio_iommu_type1_dirty_bitmap_get {
+    __u64              iova;	/* IO virtual address */
+    __u64              size;	/* Size of iova range */
+    struct vfio_bitmap bitmap;
+};
+
+#define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 02/18] vfio: Add function to unmap VFIO region
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
  2020-05-20 18:24 ` [PATCH QEMU v23 01/18] vfio: KABI for migration interface - Kernel header placeholder Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-20 18:24 ` [PATCH QEMU v23 03/18] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

This function will be used for migration region.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/common.c              | 32 ++++++++++++++++++++++++++++----
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0c4..90e9a854d82c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -925,6 +925,18 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
     return 0;
 }
 
+static void vfio_subregion_unmap(VFIORegion *region, int index)
+{
+    trace_vfio_region_unmap(memory_region_name(&region->mmaps[index].mem),
+                            region->mmaps[index].offset,
+                            region->mmaps[index].offset +
+                            region->mmaps[index].size - 1);
+    memory_region_del_subregion(region->mem, &region->mmaps[index].mem);
+    munmap(region->mmaps[index].mmap, region->mmaps[index].size);
+    object_unparent(OBJECT(&region->mmaps[index].mem));
+    region->mmaps[index].mmap = NULL;
+}
+
 int vfio_region_mmap(VFIORegion *region)
 {
     int i, prot = 0;
@@ -955,10 +967,7 @@ int vfio_region_mmap(VFIORegion *region)
             region->mmaps[i].mmap = NULL;
 
             for (i--; i >= 0; i--) {
-                memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
-                munmap(region->mmaps[i].mmap, region->mmaps[i].size);
-                object_unparent(OBJECT(&region->mmaps[i].mem));
-                region->mmaps[i].mmap = NULL;
+                vfio_subregion_unmap(region, i);
             }
 
             return ret;
@@ -983,6 +992,21 @@ int vfio_region_mmap(VFIORegion *region)
     return 0;
 }
 
+void vfio_region_unmap(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            vfio_subregion_unmap(region, i);
+        }
+    }
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
     int i;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33ffd..8cdc27946cb8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -111,6 +111,7 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd564209ac71..8d7a0fbb1046 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                       int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 03/18] vfio: Add vfio_get_object callback to VFIODeviceOps
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
  2020-05-20 18:24 ` [PATCH QEMU v23 01/18] vfio: KABI for migration interface - Kernel header placeholder Kirti Wankhede
  2020-05-20 18:24 ` [PATCH QEMU v23 02/18] vfio: Add function to unmap VFIO region Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-20 18:24 ` [PATCH QEMU v23 04/18] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Hook vfio_get_object callback for PCI devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/pci.c                 | 8 ++++++++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 342dd6b9129d..0514ba373d1c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2407,10 +2407,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice *vbasedev)
     }
 }
 
+static Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+    return OBJECT(vdev);
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
+    .vfio_get_object = vfio_pci_get_object,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8d7a0fbb1046..74261feaeac9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -119,6 +119,7 @@ struct VFIODeviceOps {
     void (*vfio_compute_needs_reset)(VFIODevice *vdev);
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
+    Object *(*vfio_get_object)(VFIODevice *vdev);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 04/18] vfio: Add save and load functions for VFIO PCI devices
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (2 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 03/18] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21  9:50   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 05/18] vfio: Add migration region initialization and finalize function Kirti Wankhede
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

These functions save and restore PCI device specific data - config
space of PCI device.
Tested save and restore with MSI and MSIX type.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   2 +
 2 files changed, 165 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0514ba373d1c..94535f0e27cd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/blocker.h"
+#include "migration/qemu-file.h"
 
 #define TYPE_VFIO_PCI "vfio-pci"
 #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
@@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
     }
 }
 
+static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t addr;
+    uint32_t addr_lo, addr_hi = 0;
+
+    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
+    if (!bar->size) {
+        return 0;
+    }
+
+    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
+
+    addr_lo &= (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
+                              PCI_BASE_ADDRESS_MEM_MASK);
+    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
+        addr_hi = pci_default_read_config(pdev,
+                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
+    }
+
+    addr = ((uint64_t)addr_hi << 32) | addr_lo;
+
+    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int vfio_bars_validate(VFIOPCIDevice *vdev)
+{
+    int i, ret;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        ret = vfio_bar_validate(vdev, i);
+        if (ret) {
+            error_report("vfio: BAR address %d validation failed", i);
+            return ret;
+        }
+    }
+    return 0;
+}
+
 static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
@@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
     return OBJECT(vdev);
 }
 
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint16_t pci_cmd;
+    int i;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar;
+
+        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+        qemu_put_be32(f, bar);
+    }
+
+    qemu_put_be32(f, vdev->interrupt);
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+
+        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                                            2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        qemu_put_be32(f, msi_addr_lo);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                             4);
+        }
+        qemu_put_be32(f, msi_addr_hi);
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+        qemu_put_be16(f, msi_data);
+    } else if (vdev->interrupt == VFIO_INT_MSIX) {
+        uint16_t offset;
+
+        /* save enable bit and maskall bit */
+        offset = pci_default_read_config(pdev,
+                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
+        qemu_put_be16(f, offset);
+        msix_save(pdev, f);
+    }
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    qemu_put_be16(f, pci_cmd);
+}
+
+static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t interrupt_type;
+    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+    uint16_t pci_cmd;
+    bool msi_64bit;
+    int i, ret;
+
+    /* retore pci bar configuration */
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar = qemu_get_be32(f);
+
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+    }
+
+    ret = vfio_bars_validate(vdev);
+    if (ret) {
+        return ret;
+    }
+
+    interrupt_type = qemu_get_be32(f);
+
+    if (interrupt_type == VFIO_INT_MSI) {
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+        msi_addr_lo = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo, 4);
+
+        msi_addr_hi = qemu_get_be32(f);
+        if (msi_64bit) {
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+        msi_data = qemu_get_be16(f);
+        vfio_pci_write_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                msi_data, 2);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
+    } else if (interrupt_type == VFIO_INT_MSIX) {
+        uint16_t offset = qemu_get_be16(f);
+
+        /* load enable bit and maskall bit */
+        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
+                              offset, 2);
+        msix_load(pdev, f);
+    }
+    pci_cmd = qemu_get_be16(f);
+    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
+    return 0;
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
     .vfio_get_object = vfio_pci_get_object,
+    .vfio_save_config = vfio_pci_save_config,
+    .vfio_load_config = vfio_pci_load_config,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 74261feaeac9..d69a7f3ae31e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
     Object *(*vfio_get_object)(VFIODevice *vdev);
+    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 05/18] vfio: Add migration region initialization and finalize function
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (3 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 04/18] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21  9:59   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 06/18] vfio: Add VM state change handler to know state of VM Kirti Wankhede
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

- Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
  patch series.
- VFIO device supports migration or not is decided based of migration region
  query. If migration region query is successful and migration region
  initialization is successful then migration is supported else migration is
  blocked.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 138 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   3 +
 include/hw/vfio/vfio-common.h |   9 +++
 4 files changed, 151 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index 9bb1c09e8477..8b296c889ed9 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,4 +1,4 @@
-obj-y += common.o spapr.o
+obj-y += common.o spapr.o migration.o
 obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_VFIO_PLATFORM) += platform.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..bf9384907ec0
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,138 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.size) {
+        vfio_region_exit(&migration->region);
+        vfio_region_finalize(&migration->region);
+    }
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    Object *obj = NULL;
+    int ret = -EINVAL;
+
+    if (!vbasedev->ops->vfio_get_object) {
+        return ret;
+    }
+
+    obj = vbasedev->ops->vfio_get_object(vbasedev);
+    if (!obj) {
+        return ret;
+    }
+
+    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
+                            "migration");
+    if (ret) {
+        error_report("%s: Failed to setup VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    if (!migration->region.size) {
+        ret = -EINVAL;
+        error_report("%s: Invalid region size of VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+                               struct vfio_region_info *info)
+{
+    int ret;
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+
+    ret = vfio_migration_region_init(vbasedev, info->index);
+    if (ret) {
+        error_report("%s: Failed to initialise migration region",
+                     vbasedev->name);
+        g_free(vbasedev->migration);
+        vbasedev->migration = NULL;
+        return ret;
+    }
+
+    return 0;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_region_info *info;
+    Error *local_err = NULL;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    ret = vfio_migration_init(vbasedev, info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    trace_vfio_migration_probe(vbasedev->name, info->index);
+    return 0;
+
+add_blocker:
+    error_setg(&vbasedev->migration_blocker,
+               "VFIO device doesn't support migration");
+    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vbasedev->migration_blocker);
+    }
+    return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    g_free(vbasedev->migration);
+}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8cdc27946cb8..fd034ac53684 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
 vfio_display_edid_link_down(void) ""
 vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
+
+# migration.c
+vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d69a7f3ae31e..d4b268641173 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -57,6 +57,10 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    VFIORegion region;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -113,6 +117,8 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 06/18] vfio: Add VM state change handler to know state of VM
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (4 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 05/18] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 11:19   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 07/18] vfio: Add migration state change notifier Kirti Wankhede
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

VM state change handler gets called on change in VM's state. This is used to set
VFIO device state to _RUNNING.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 87 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  2 +
 include/hw/vfio/vfio-common.h |  4 ++
 3 files changed, 93 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index bf9384907ec0..e79b34003079 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include <linux/vfio.h>
 
+#include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
@@ -74,6 +75,85 @@ err:
     return ret;
 }
 
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
+                                    uint32_t value)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint32_t device_state;
+    int ret;
+
+    ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("%s: Failed to read device state %d %s",
+                     vbasedev->name, ret, strerror(errno));
+        return ret;
+    }
+
+    device_state = (device_state & mask) | value;
+
+    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
+        return -EINVAL;
+    }
+
+    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("%s: Failed to set device state %d %s",
+                     vbasedev->name, ret, strerror(errno));
+
+        ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                device_state));
+        if (ret < 0) {
+            error_report("%s: On failure, failed to read device state %d %s",
+                    vbasedev->name, ret, strerror(errno));
+            return ret;
+        }
+
+        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
+            error_report("%s: Device is in error state 0x%x",
+                         vbasedev->name, device_state);
+            return -EFAULT;
+        }
+    }
+
+    vbasedev->device_state = device_state;
+    trace_vfio_migration_set_state(vbasedev->name, device_state);
+    return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running)) {
+        int ret;
+        uint32_t value = 0, mask = 0;
+
+        if (running) {
+            value = VFIO_DEVICE_STATE_RUNNING;
+            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
+                mask = ~VFIO_DEVICE_STATE_RESUMING;
+            }
+        } else {
+            mask = ~VFIO_DEVICE_STATE_RUNNING;
+        }
+
+        ret = vfio_migration_set_state(vbasedev, mask, value);
+        if (ret) {
+            error_report("%s: Failed to set device state 0x%x",
+                         vbasedev->name, value & mask);
+        }
+        vbasedev->vm_running = running;
+        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
+                                  value & mask);
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -90,6 +170,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
+
     return 0;
 }
 
@@ -128,6 +211,10 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+    }
+
     if (vbasedev->migration_blocker) {
         migrate_del_blocker(vbasedev->migration_blocker);
         error_free(vbasedev->migration_blocker);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index fd034ac53684..14b0a86c0035 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d4b268641173..3d18eb146b33 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
+#include "sysemu/sysemu.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -119,6 +120,9 @@ typedef struct VFIODevice {
     unsigned int flags;
     VFIOMigration *migration;
     Error *migration_blocker;
+    VMChangeStateEntry *vm_state;
+    uint32_t device_state;
+    int vm_running;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 07/18] vfio: Add migration state change notifier
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (5 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 06/18] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 11:31   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 08/18] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to vendor
driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 30 ++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  5 +++--
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e79b34003079..c2f5564b51c3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -154,6 +154,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
     }
 }
 
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    trace_vfio_migration_state_notifier(vbasedev->name,
+                                        MigrationStatus_str(s->state));
+
+    switch (s->state) {
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev,
+                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
+                      VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("%s: Failed to set state RUNNING", vbasedev->name);
+        }
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -173,6 +195,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
+
     return 0;
 }
 
@@ -211,6 +236,11 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+
+    if (vbasedev->migration_state.notify) {
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
     if (vbasedev->vm_state) {
         qemu_del_vm_change_state_handler(vbasedev->vm_state);
     }
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 14b0a86c0035..bd3d47b005cb 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -146,5 +146,6 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
-vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
-vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 3d18eb146b33..28f55f66d019 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -123,6 +123,7 @@ typedef struct VFIODevice {
     VMChangeStateEntry *vm_state;
     uint32_t device_state;
     int vm_running;
+    Notifier migration_state;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 08/18] vfio: Register SaveVMHandlers for VFIO device
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (6 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 07/18] vfio: Add migration state change notifier Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 14:18   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 09/18] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Define flags to be used as delimeter in migration file stream.
Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
region from these functions at source during saving or pre-copy phase.
Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |  2 ++
 2 files changed, 75 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index c2f5564b51c3..773c8d16b1c1 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,12 +8,14 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
 #include <linux/vfio.h>
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
+#include "migration/vmstate.h"
 #include "migration/qemu-file.h"
 #include "migration/register.h"
 #include "migration/blocker.h"
@@ -24,6 +26,17 @@
 #include "pci.h"
 #include "trace.h"
 
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -126,6 +139,64 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
     return 0;
 }
 
+/* ---------------------------------------------------------------------- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    trace_vfio_save_setup(vbasedev->name);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    if (migration->region.mmaps) {
+        qemu_mutex_lock_iothread();
+        ret = vfio_region_mmap(&migration->region);
+        qemu_mutex_unlock_iothread();
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.index,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state SAVING", vbasedev->name);
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->region.mmaps) {
+        vfio_region_unmap(&migration->region);
+    }
+    trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+};
+
+/* ---------------------------------------------------------------------- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -192,6 +263,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    register_savevm_live("vfio", VMSTATE_INSTANCE_ID_ANY, 1,
+                         &savevm_vfio_handlers, vbasedev);
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index bd3d47b005cb..86c18def016e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,3 +149,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
+vfio_save_setup(const char *name) " (%s)"
+vfio_save_cleanup(const char *name) " (%s)"
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 09/18] vfio: Add save state functions to SaveVMHandlers
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (7 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 08/18] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 15:37   ` Dr. David Alan Gilbert
  2020-05-21 20:43   ` Peter Xu
  2020-05-20 18:24 ` [PATCH QEMU v23 10/18] vfio: Add load " Kirti Wankhede
                   ` (8 subsequent siblings)
  17 siblings, 2 replies; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes. If pending_bytes > 0, go through below steps.
- read data_offset - indicates kernel driver to write data to staging
  buffer.
- read data_size - amount of data in bytes written by vendor driver in
  migration region.
- read data_size bytes of data from data_offset in the migration region.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes. If pending_bytes > 0, go through below steps.
c. read data_offset - indicates kernel driver to write data to staging
   buffer.
d. read data_size - amount of data in bytes written by vendor driver in
   migration region.
e. read data_size bytes of data from data_offset in the migration region.
f. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
g. iterate through steps b to f while (pending_bytes > 0)
h. Write {VFIO_MIG_FLAG_END_OF_STATE}

When data region is mapped, its user's responsibility to read data from
data_offset of data_size before moving to next steps.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 254 +++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events          |   6 +
 include/hw/vfio/vfio-common.h |   1 +
 3 files changed, 260 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 773c8d16b1c1..d90bd2296afd 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -139,6 +139,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
     return 0;
 }
 
+static void *find_data_region(VFIORegion *region,
+                              uint64_t data_offset,
+                              uint64_t data_size)
+{
+    void *ptr = NULL;
+    int i;
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if ((data_offset >= region->mmaps[i].offset) &&
+            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&
+            (data_size <= region->mmaps[i].size)) {
+            ptr = region->mmaps[i].mmap + (data_offset -
+                                           region->mmaps[i].offset);
+            break;
+        }
+    }
+    return ptr;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t data_offset = 0, data_size = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+    if (ret != sizeof(data_offset)) {
+        error_report("%s: Failed to get migration buffer data offset %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_size));
+    if (ret != sizeof(data_size)) {
+        error_report("%s: Failed to get migration buffer data size %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    if (data_size > 0) {
+        void *buf = NULL;
+        bool buffer_mmaped;
+
+        if (region->mmaps) {
+            buf = find_data_region(region, data_offset, data_size);
+        }
+
+        buffer_mmaped = (buf != NULL);
+
+        if (!buffer_mmaped) {
+            buf = g_try_malloc(data_size);
+            if (!buf) {
+                error_report("%s: Error allocating buffer ", __func__);
+                return -ENOMEM;
+            }
+
+            ret = pread(vbasedev->fd, buf, data_size,
+                        region->fd_offset + data_offset);
+            if (ret != data_size) {
+                error_report("%s: Failed to get migration data %d",
+                             vbasedev->name, ret);
+                g_free(buf);
+                return -EINVAL;
+            }
+        }
+
+        qemu_put_be64(f, data_size);
+        qemu_put_buffer(f, buf, data_size);
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+    } else {
+        qemu_put_be64(f, data_size);
+    }
+
+    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+                           migration->pending_bytes);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return data_size;
+}
+
+static int vfio_update_pending(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t pending_bytes = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending_bytes));
+    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
+        error_report("%s: Failed to get pending bytes %d",
+                     vbasedev->name, ret);
+        migration->pending_bytes = 0;
+        return (ret < 0) ? ret : -EINVAL;
+    }
+
+    migration->pending_bytes = pending_bytes;
+    trace_vfio_update_pending(vbasedev->name, pending_bytes);
+    return 0;
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
+        vbasedev->ops->vfio_save_config(vbasedev, f);
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_device_config_state(vbasedev->name);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -157,7 +288,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         qemu_mutex_unlock_iothread();
         if (ret) {
             error_report("%s: Failed to mmap VFIO migration region %d: %s",
-                         vbasedev->name, migration->region.index,
+                         vbasedev->name, migration->region.nr,
                          strerror(-ret));
             return ret;
         }
@@ -190,9 +321,130 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return;
+    }
+
+    *res_precopy_only += migration->pending_bytes;
+
+    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
+                            *res_postcopy_only, *res_compatible);
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret, data_size;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+
+    if (migration->pending_bytes == 0) {
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+
+        if (migration->pending_bytes == 0) {
+            /* indicates data finished, goto complete phase */
+            return 1;
+        }
+    }
+
+    data_size = vfio_save_buffer(f, vbasedev);
+
+    if (data_size < 0) {
+        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
+                     strerror(errno));
+        return data_size;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_iterate(vbasedev->name, data_size);
+
+    return 0;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
+                                   VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state STOP and SAVING",
+                     vbasedev->name);
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    while (migration->pending_bytes > 0) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+        ret = vfio_save_buffer(f, vbasedev);
+        if (ret < 0) {
+            error_report("%s: Failed to save buffer", vbasedev->name);
+            return ret;
+        } else if (ret == 0) {
+            break;
+        }
+
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
+    if (ret) {
+        error_report("%s: Failed to set state STOPPED", vbasedev->name);
+        return ret;
+    }
+
+    trace_vfio_save_complete_precopy(vbasedev->name);
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 86c18def016e..9a1c5e17d97f 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -151,3 +151,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
 vfio_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
+vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
+vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
+vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
+vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
+vfio_save_complete_precopy(const char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 28f55f66d019..c78033e4149d 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -60,6 +60,7 @@ typedef struct VFIORegion {
 
 typedef struct VFIOMigration {
     VFIORegion region;
+    uint64_t pending_bytes;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 10/18] vfio: Add load state functions to SaveVMHandlers
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (8 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 09/18] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 15:53   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 11/18] iommu: add callback to get address limit IOMMU supports Kirti Wankhede
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Sequence  during _RESUMING device state:
While data for this device is available, repeat below steps:
a. read data_offset from where user application should write data.
b. write data of data_size to migration region from data_offset.
c. write data_size which indicates vendor driver that data is written in
   staging buffer.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   3 +
 2 files changed, 183 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index d90bd2296afd..b9bbe38e539c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -270,6 +270,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    uint64_t data;
+
+    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+        int ret;
+
+        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
+        if (ret) {
+            error_report("%s: Failed to load device config space",
+                         vbasedev->name);
+            return ret;
+        }
+    }
+
+    data = qemu_get_be64(f);
+    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("%s: Failed loading device config space, "
+                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+        return -EINVAL;
+    }
+
+    trace_vfio_load_device_config_state(vbasedev->name);
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -439,12 +466,165 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+
+    if (migration->region.mmaps) {
+        ret = vfio_region_mmap(&migration->region);
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.nr,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
+                                   VFIO_DEVICE_STATE_RESUMING);
+    if (ret) {
+        error_report("%s: Failed to set state RESUMING", vbasedev->name);
+    }
+    return ret;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    vfio_save_cleanup(opaque);
+    return 0;
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+    uint64_t data, data_size;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+
+        trace_vfio_load_state(vbasedev->name, data);
+
+        switch (data) {
+        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
+        {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
+        {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return ret;
+            } else {
+                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
+                             vbasedev->name, data);
+                return -EINVAL;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE:
+        {
+            VFIORegion *region = &migration->region;
+            void *buf = NULL;
+            bool buffer_mmaped = false;
+            uint64_t data_offset = 0;
+
+            data_size = qemu_get_be64(f);
+            if (data_size == 0) {
+                break;
+            }
+
+            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                        region->fd_offset +
+                        offsetof(struct vfio_device_migration_info,
+                        data_offset));
+            if (ret != sizeof(data_offset)) {
+                error_report("%s:Failed to get migration buffer data offset %d",
+                             vbasedev->name, ret);
+                return -EINVAL;
+            }
+
+            if (region->mmaps) {
+                buf = find_data_region(region, data_offset, data_size);
+            }
+
+            buffer_mmaped = (buf != NULL) ? true : false;
+
+            if (!buffer_mmaped) {
+                buf = g_try_malloc0(data_size);
+                if (!buf) {
+                    error_report("%s: Error allocating buffer ", __func__);
+                    return -ENOMEM;
+                }
+            }
+
+            qemu_get_buffer(f, buf, data_size);
+
+            if (!buffer_mmaped) {
+                ret = pwrite(vbasedev->fd, buf, data_size,
+                             region->fd_offset + data_offset);
+                g_free(buf);
+
+                if (ret != data_size) {
+                    error_report("%s: Failed to set migration buffer %d",
+                                 vbasedev->name, ret);
+                    return -EINVAL;
+                }
+            }
+
+            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
+                         region->fd_offset +
+                       offsetof(struct vfio_device_migration_info, data_size));
+            if (ret != sizeof(data_size)) {
+                error_report("%s: Failed to set migration buffer data size %d",
+                             vbasedev->name, ret);
+                if (!buffer_mmaped) {
+                    g_free(buf);
+                }
+                return -EINVAL;
+            }
+
+            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
+                                              data_size);
+            break;
+        }
+
+        default:
+            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
+            return -EINVAL;
+        }
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+
+        data = qemu_get_be64(f);
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 9a1c5e17d97f..4a4bd3ba9a2a 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -157,3 +157,6 @@ vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(const char *name) " (%s)"
+vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 11/18] iommu: add callback to get address limit IOMMU supports
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (9 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 10/18] vfio: Add load " Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 16:13   ` Peter Xu
  2020-05-20 18:24 ` [PATCH QEMU v23 12/18] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Add optional method to get address limit IOMMU supports

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/i386/intel_iommu.c |  9 +++++++++
 include/exec/memory.h | 18 ++++++++++++++++++
 memory.c              | 11 +++++++++++
 3 files changed, 38 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index df7ad254ac15..d0b88c20c31e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3577,6 +3577,14 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     return;
 }
 
+static hwaddr vtd_iommu_get_address_limit(IOMMUMemoryRegion *iommu_mr)
+{
+    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+
+    return VTD_ADDRESS_SIZE(s->aw_bits) - 1;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -3878,6 +3886,7 @@ static void vtd_iommu_memory_region_class_init(ObjectClass *klass,
     imrc->translate = vtd_iommu_translate;
     imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
     imrc->replay = vtd_iommu_replay;
+    imrc->get_address_limit = vtd_iommu_get_address_limit;
 }
 
 static const TypeInfo vtd_iommu_memory_region_info = {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index e000bd2f97b2..2d0cbd46d2a6 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -355,6 +355,16 @@ typedef struct IOMMUMemoryRegionClass {
      * @iommu: the IOMMUMemoryRegion
      */
     int (*num_indexes)(IOMMUMemoryRegion *iommu);
+
+    /*
+     * Return address limit this IOMMU supports.
+     *
+     * Optional method: if this method is not provided, then
+     * memory_region_iommu_get_address_limit() will return 0.
+     *
+     * @iommu: the IOMMUMemoryRegion
+     */
+    hwaddr (*get_address_limit)(IOMMUMemoryRegion *iommu);
 } IOMMUMemoryRegionClass;
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -1364,6 +1374,14 @@ int memory_region_iommu_attrs_to_index(IOMMUMemoryRegion *iommu_mr,
 int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr);
 
 /**
+ * memory_region_iommu_get_address_limit : return the maximum address limit
+ * that this IOMMU supports.
+ *
+ * @iommu_mr: the memory region
+ */
+hwaddr memory_region_iommu_get_address_limit(IOMMUMemoryRegion *iommu_mr);
+
+/**
  * memory_region_name: get a memory region's name
  *
  * Returns the string that was used to initialize the memory region.
diff --git a/memory.c b/memory.c
index fd6f3d6aca18..a6c69e1391ba 100644
--- a/memory.c
+++ b/memory.c
@@ -1887,6 +1887,17 @@ void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     }
 }
 
+hwaddr memory_region_iommu_get_address_limit(IOMMUMemoryRegion *iommu_mr)
+{
+    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
+
+    if (imrc->get_address_limit) {
+        return imrc->get_address_limit(iommu_mr);
+    }
+
+    return 0;
+}
+
 void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
                                              IOMMUNotifier *n)
 {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 12/18] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (10 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 11/18] iommu: add callback to get address limit IOMMU supports Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 16:20   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 13/18] vfio: Get migration capability flags for container Kirti Wankhede
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/memory.c b/memory.c
index a6c69e1391ba..27f22f1a9974 100644
--- a/memory.c
+++ b/memory.c
@@ -1788,7 +1788,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
 {
     uint8_t mask = mr->dirty_log_mask;
-    if (global_dirty_log && mr->ram_block) {
+    if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
         mask |= (1 << DIRTY_MEMORY_MIGRATION);
     }
     return mask;
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 13/18] vfio: Get migration capability flags for container
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (11 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 12/18] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-20 18:24 ` [PATCH QEMU v23 14/18] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, Eric Auger, changpeng.liu, eskultet, Shameer Kolothum,
	Ken.Xue, jonathan.davies, pbonzini

Added helper functions to get IOMMU info capability chain.
Added function to get migration capability information from that
capability chain for IOMMU container.

Similar change was proposed earlier:
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/common.c              | 91 +++++++++++++++++++++++++++++++++++++++----
 include/hw/vfio/vfio-common.h |  3 ++
 2 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 90e9a854d82c..e0d3d4585a65 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1229,6 +1229,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     return 0;
 }
 
+static int vfio_get_iommu_info(VFIOContainer *container,
+                               struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+    *info = g_new0(struct vfio_iommu_type1_info, 1);
+again:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto again;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static void vfio_get_iommu_info_migration(VFIOContainer *container,
+                                         struct vfio_iommu_type1_info *info)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_migration *cap_mig;
+
+    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
+    if (!hdr) {
+        return;
+    }
+
+    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
+                            header);
+
+    container->dirty_pages_supported = true;
+    container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+    container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+     * TARGET_PAGE_SIZE to mark those dirty.
+     */
+    assert(container->dirty_pgsizes & TARGET_PAGE_SIZE);
+}
+
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
                                   Error **errp)
 {
@@ -1293,6 +1362,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container->space = space;
     container->fd = fd;
     container->error = NULL;
+    container->dirty_pages_supported = false;
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
 
@@ -1305,7 +1375,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     case VFIO_TYPE1v2_IOMMU:
     case VFIO_TYPE1_IOMMU:
     {
-        struct vfio_iommu_type1_info info;
+        struct vfio_iommu_type1_info *info;
 
         /*
          * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
@@ -1314,15 +1384,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
          * existing Type1 IOMMUs generally support any IOVA we're
          * going to actually try in practice.
          */
-        info.argsz = sizeof(info);
-        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
-        /* Ignore errors */
-        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
+        ret = vfio_get_iommu_info(container, &info);
+        if (ret) {
+                goto free_container_exit;
+        }
+
+        if (!(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
             /* Assume 4k IOVA page size */
-            info.iova_pgsizes = 4096;
+            info->iova_pgsizes = 4096;
         }
-        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
-        container->pgsizes = info.iova_pgsizes;
+        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
+        container->pgsizes = info->iova_pgsizes;
+
+        vfio_get_iommu_info_migration(container, info);
+        g_free(info);
         break;
     }
     case VFIO_SPAPR_TCE_v2_IOMMU:
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c78033e4149d..5a57a78ec517 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -79,6 +79,9 @@ typedef struct VFIOContainer {
     unsigned iommu_type;
     Error *error;
     bool initialized;
+    bool dirty_pages_supported;
+    uint64_t dirty_pgsizes;
+    uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 14/18] vfio: Add function to start and stop dirty pages tracking
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (12 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 13/18] vfio: Get migration capability flags for container Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 16:50   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 15/18] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
for VFIO devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index b9bbe38e539c..7729c90782bd 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -9,6 +9,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/main-loop.h"
+#include <sys/ioctl.h>
 #include <linux/vfio.h>
 
 #include "sysemu/runstate.h"
@@ -297,6 +298,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
+{
+    int ret;
+    VFIOContainer *container = vbasedev->group->container;
+    struct vfio_iommu_type1_dirty_bitmap dirty = {
+        .argsz = sizeof(dirty),
+    };
+
+    if (start) {
+        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
+            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+        } else {
+            return -EINVAL;
+        }
+    } else {
+            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+    if (ret) {
+        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                     dirty.flags, errno);
+    }
+    return ret;
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -327,6 +354,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return ret;
     }
 
+    ret = vfio_start_dirty_page_tracking(vbasedev, true);
+    if (ret) {
+        return ret;
+    }
+
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
     ret = qemu_file_get_error(f);
@@ -342,6 +374,8 @@ static void vfio_save_cleanup(void *opaque)
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
+    vfio_start_dirty_page_tracking(vbasedev, false);
+
     if (migration->region.mmaps) {
         vfio_region_unmap(&migration->region);
     }
@@ -676,6 +710,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
         if (ret) {
             error_report("%s: Failed to set state RUNNING", vbasedev->name);
         }
+
+        vfio_start_dirty_page_tracking(vbasedev, false);
     }
 }
 
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 15/18] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (13 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 14/18] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 18:52   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 16/18] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

vfio_listener_log_sync gets list of dirty pages from container using
VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
devices are stopped and saving state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c     | 184 +++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/vfio/trace-events |   1 +
 2 files changed, 180 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index e0d3d4585a65..3f7049fbd1b0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "hw/vfio/vfio.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
+#include "exec/ram_addr.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -38,6 +39,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
 };
 
 /*
+ * Device state interfaces
+ */
+
+static bool vfio_devices_are_stopped_and_saving(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+/*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
@@ -408,8 +432,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 }
 
 /* Called with rcu_read_lock held.  */
-static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
-                           bool *read_only)
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
+                               ram_addr_t *ram_addr, bool *read_only)
 {
     MemoryRegion *mr;
     hwaddr xlat;
@@ -440,9 +464,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
         return false;
     }
 
-    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
-    *read_only = !writable || mr->readonly;
+    if (vaddr) {
+        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
+    }
 
+    if (ram_addr) {
+        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
+    }
+
+    if (read_only) {
+        *read_only = !writable || mr->readonly;
+    }
     return true;
 }
 
@@ -467,7 +499,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     rcu_read_lock();
 
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
+        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
             goto out;
         }
         /*
@@ -813,9 +845,151 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static int vfio_get_dirty_bitmap(MemoryListener *listener,
+                                 MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOGuestIOMMU *giommu = NULL;
+    IOMMUTLBEntry iotlb;
+    hwaddr granularity, iova, iova_end;
+    int ret;
+
+    if (memory_region_is_iommu(section->mr)) {
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            if (MEMORY_REGION(giommu->iommu) == section->mr &&
+                giommu->n.start == section->offset_within_region) {
+                break;
+            }
+        }
+
+        if (!giommu) {
+            return -EINVAL;
+        }
+    }
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+
+    if (memory_region_is_iommu(section->mr)) {
+        hwaddr iova_size;
+
+        granularity = memory_region_iommu_get_min_page_size(giommu->iommu);
+        iova_end = memory_region_iommu_get_address_limit(giommu->iommu);
+
+        if (iova_end) {
+            iova_size = MIN(int128_get64(section->size), iova_end - iova + 1);
+        } else {
+            iova_size = int128_get64(section->size);
+        }
+
+        iova_end = iova + iova_size - 1;
+    } else {
+        granularity = memory_region_size(section->mr);
+        iova_end = iova + int128_get64(section->size) - 1;
+    }
+
+    RCU_READ_LOCK_GUARD();
+
+    while (iova < iova_end) {
+        struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+        struct vfio_iommu_type1_dirty_bitmap_get *range;
+        ram_addr_t start, pages;
+        uint64_t iova_xlat, size;
+
+        if (memory_region_is_iommu(section->mr)) {
+            iotlb = address_space_get_iotlb_entry(container->space->as, iova,
+                                                 true, MEMTXATTRS_UNSPECIFIED);
+            if ((iotlb.target_as == NULL) || (iotlb.addr_mask == 0)) {
+                if ((iova + granularity) < iova) {
+                    break;
+                }
+                iova += granularity;
+                continue;
+            }
+            iova_xlat = iotlb.iova + giommu->iommu_offset;
+            size = iotlb.addr_mask + 1;
+        } else {
+            iova_xlat = iova;
+            size = iova_end - iova + 1;
+        }
+
+        dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+        dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+        dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+        range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
+        range->iova = iova_xlat;
+        range->size = size;
+
+        /*
+         * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+         * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
+         * TARGET_PAGE_SIZE.
+         */
+        range->bitmap.pgsize = TARGET_PAGE_SIZE;
+
+        pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
+        range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                             BITS_PER_BYTE;
+        range->bitmap.data = g_malloc0(range->bitmap.size);
+
+        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+        if (ret) {
+            error_report("Failed to get dirty bitmap for iova: 0x%llx "
+                         "size: 0x%llx err: %d",
+                         range->iova, range->size, errno);
+            goto err_out;
+        }
+
+        if (memory_region_is_iommu(section->mr)) {
+            if (!vfio_get_xlat_addr(&iotlb, NULL, &start, NULL)) {
+                ret = -EINVAL;
+                goto err_out;
+            }
+        } else {
+            start = memory_region_get_ram_addr(section->mr) +
+                    section->offset_within_region + iova -
+                    TARGET_PAGE_ALIGN(section->offset_within_address_space);
+        }
+
+        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
+                                               start, pages);
+
+        trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
+                                    range->bitmap.size, start);
+err_out:
+        g_free(range->bitmap.data);
+        g_free(dbitmap);
+
+        if (ret) {
+            return ret;
+        }
+
+        if ((iova + size) < iova) {
+            break;
+        }
+
+        iova += size;
+    }
+
+    return 0;
+}
+
+static void vfio_listerner_log_sync(MemoryListener *listener,
+        MemoryRegionSection *section)
+{
+    if (vfio_listener_skipped_section(section)) {
+        return;
+    }
+
+    if (vfio_devices_are_stopped_and_saving()) {
+        vfio_get_dirty_bitmap(listener, section);
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 4a4bd3ba9a2a..c61ae4f3ead8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -160,3 +160,4 @@ vfio_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 16/18] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (14 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 15/18] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 19:05   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 17/18] vfio: Make vfio-pci device migration capable Kirti Wankhede
  2020-05-20 18:24 ` [PATCH QEMU v23 18/18] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

With vIOMMU, IO virtual address range can get unmapped while in pre-copy
phase of migration. In that case, unmap ioctl should return pages pinned
in that range and QEMU should find its correcponding guest physical
addresses and report those dirty.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 80 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3f7049fbd1b0..2dd9e8b24788 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -311,11 +311,82 @@ static bool vfio_devices_are_stopped_and_saving(void)
     return true;
 }
 
+static bool vfio_devices_are_running_and_saving(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+                                 hwaddr iova, ram_addr_t size,
+                                 IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap *unmap;
+    struct vfio_bitmap *bitmap;
+    uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
+    int ret;
+
+    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+    if (!unmap) {
+        return -ENOMEM;
+    }
+
+    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+    unmap->iova = iova;
+    unmap->size = size;
+    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+    bitmap = (struct vfio_bitmap *)&unmap->data;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
+     * TARGET_PAGE_SIZE.
+     */
+
+    bitmap->pgsize = TARGET_PAGE_SIZE;
+    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                   BITS_PER_BYTE;
+
+    if (bitmap->size > container->max_dirty_bitmap_size) {
+        error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
+        ret = -E2BIG;
+        goto unmap_exit;
+    }
+
+    bitmap->data = g_malloc0(bitmap->size);
+
+    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    if (!ret) {
+        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
+                iotlb->translated_addr, pages);
+    } else {
+        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %d", -errno);
+    }
+
+    g_free(bitmap->data);
+unmap_exit:
+    g_free(unmap);
+    return ret;
+}
+
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size)
+                          hwaddr iova, ram_addr_t size,
+                          IOMMUTLBEntry *iotlb)
 {
     struct vfio_iommu_type1_dma_unmap unmap = {
         .argsz = sizeof(unmap),
@@ -324,6 +395,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
+    if (iotlb && container->dirty_pages_supported &&
+        vfio_devices_are_running_and_saving()) {
+        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+    }
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -371,7 +447,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
      * the VGA ROM space.
      */
     if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
+        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
          ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
         return 0;
     }
@@ -519,7 +595,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -822,7 +898,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 
     if (try_unmap) {
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
+        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 17/18] vfio: Make vfio-pci device migration capable
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (15 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 16/18] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 19:16   ` Dr. David Alan Gilbert
  2020-05-20 18:24 ` [PATCH QEMU v23 18/18] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
  17 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

If device is not failover primary device call vfio_migration_probe()
and vfio_migration_finalize() functions for vfio-pci device to enable
migration for vfio PCI device which support migration.
Removed vfio_pci_vmstate structure.
Removed migration blocker from VFIO PCI device specific structure and use
migration blocker from generic structure of  VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c | 32 +++++++++++---------------------
 hw/vfio/pci.h |  1 -
 2 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 94535f0e27cd..573b28753bfd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2916,22 +2916,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         return;
     }
 
-    if (!pdev->failover_pair_id) {
-        error_setg(&vdev->migration_blocker,
-                "VFIO device doesn't support migration");
-        ret = migrate_add_blocker(vdev->migration_blocker, &err);
-        if (ret) {
-            error_propagate(errp, err);
-            error_free(vdev->migration_blocker);
-            vdev->migration_blocker = NULL;
-            return;
-        }
-    }
-
     vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = DEVICE(vdev);
+    vdev->vbasedev.device_state = 0;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3195,6 +3184,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    if (!pdev->failover_pair_id) {
+        ret = vfio_migration_probe(&vdev->vbasedev, errp);
+        if (ret) {
+            error_report("%s: Failed to setup for migration",
+                         vdev->vbasedev.name);
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3209,11 +3206,6 @@ out_teardown:
     vfio_bars_exit(vdev);
 error:
     error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-    if (vdev->migration_blocker) {
-        migrate_del_blocker(vdev->migration_blocker);
-        error_free(vdev->migration_blocker);
-        vdev->migration_blocker = NULL;
-    }
 }
 
 static void vfio_instance_finalize(Object *obj)
@@ -3225,10 +3217,7 @@ static void vfio_instance_finalize(Object *obj)
     vfio_bars_finalize(vdev);
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
-    if (vdev->migration_blocker) {
-        migrate_del_blocker(vdev->migration_blocker);
-        error_free(vdev->migration_blocker);
-    }
+
     /*
      * XXX Leaking igd_opregion is not an oversight, we can't remove the
      * fw_cfg entry therefore leaking this allocation seems like the safest
@@ -3256,6 +3245,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 0da7a20a7ec2..b148c937ef72 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -168,7 +168,6 @@ typedef struct VFIOPCIDevice {
     bool no_vfio_ioeventfd;
     bool enable_ramfb;
     VFIODisplay *dpy;
-    Error *migration_blocker;
     Notifier irqchip_change_notifier;
 } VFIOPCIDevice;
 
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH QEMU v23 18/18] qapi: Add VFIO devices migration stats in Migration stats
  2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
                   ` (16 preceding siblings ...)
  2020-05-20 18:24 ` [PATCH QEMU v23 17/18] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2020-05-20 18:24 ` Kirti Wankhede
  2020-05-21 19:23   ` Dr. David Alan Gilbert
  2020-05-25 14:34   ` Markus Armbruster
  17 siblings, 2 replies; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-20 18:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Added amount of bytes transferred to the target VM by all VFIO devices

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/common.c            | 20 ++++++++++++++++++++
 hw/vfio/migration.c         | 10 +++++++++-
 include/qemu/vfio-helpers.h |  3 +++
 migration/migration.c       | 18 ++++++++++++++++++
 monitor/hmp-cmds.c          |  6 ++++++
 qapi/migration.json         | 19 ++++++++++++++++++-
 6 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2dd9e8b24788..e1ad426a6d78 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "qemu/vfio-helpers.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -293,6 +294,25 @@ const MemoryRegionOps vfio_region_ops = {
  * Device state interfaces
  */
 
+bool vfio_mig_active(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    if (QLIST_EMPTY(&vfio_group_list)) {
+        return false;
+    }
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->migration_blocker) {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
 static bool vfio_devices_are_stopped_and_saving(void)
 {
     VFIOGroup *group;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 7729c90782bd..2eef38fe5c65 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -26,7 +26,7 @@
 #include "exec/ram_addr.h"
 #include "pci.h"
 #include "trace.h"
-
+#include "qemu/vfio-helpers.h"
 /*
  * Flags used as delimiter:
  * 0xffffffff => MSB 32-bit all 1s
@@ -38,6 +38,8 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 
+static int64_t bytes_transferred;
+
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -229,6 +231,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
         return ret;
     }
 
+    bytes_transferred += data_size;
     return data_size;
 }
 
@@ -744,6 +747,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 
 /* ---------------------------------------------------------------------- */
 
+int64_t vfio_mig_bytes_transferred(void)
+{
+    return bytes_transferred;
+}
+
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
     struct vfio_region_info *info;
diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
index 1f057c2b9e40..26a7df0767b1 100644
--- a/include/qemu/vfio-helpers.h
+++ b/include/qemu/vfio-helpers.h
@@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
 int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
                            int irq_type, Error **errp);
 
+bool vfio_mig_active(void);
+int64_t vfio_mig_bytes_transferred(void);
+
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 0bb042a0f78b..bcdc880e9a76 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -54,6 +54,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
+#include "qemu/vfio-helpers.h"
 
 #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
 
@@ -970,6 +971,17 @@ static void populate_disk_info(MigrationInfo *info)
     }
 }
 
+#ifdef CONFIG_LINUX
+static void populate_vfio_info(MigrationInfo *info)
+{
+    if (vfio_mig_active()) {
+        info->has_vfio = true;
+        info->vfio = g_malloc0(sizeof(*info->vfio));
+        info->vfio->bytes = vfio_mig_bytes_transferred();
+    }
+}
+#endif
+
 static void fill_source_migration_info(MigrationInfo *info)
 {
     MigrationState *s = migrate_get_current();
@@ -995,6 +1007,9 @@ static void fill_source_migration_info(MigrationInfo *info)
         populate_time_info(info, s);
         populate_ram_info(info, s);
         populate_disk_info(info);
+#ifdef CONFIG_LINUX
+        populate_vfio_info(info);
+#endif
         break;
     case MIGRATION_STATUS_COLO:
         info->has_status = true;
@@ -1003,6 +1018,9 @@ static void fill_source_migration_info(MigrationInfo *info)
     case MIGRATION_STATUS_COMPLETED:
         populate_time_info(info, s);
         populate_ram_info(info, s);
+#ifdef CONFIG_LINUX
+        populate_vfio_info(info);
+#endif
         break;
     case MIGRATION_STATUS_FAILED:
         info->has_status = true;
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 9c61e769ca9a..396c195263c2 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -355,6 +355,12 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
         }
         monitor_printf(mon, "]\n");
     }
+
+    if (info->has_vfio) {
+        monitor_printf(mon, "vfio device bytes: %" PRIu64 " kbytes\n",
+                       info->vfio->bytes >> 10);
+    }
+
     qapi_free_MigrationInfo(info);
 }
 
diff --git a/qapi/migration.json b/qapi/migration.json
index d5000558c6c9..acbc42a1efe8 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -146,6 +146,18 @@
             'active', 'postcopy-active', 'postcopy-paused',
             'postcopy-recover', 'completed', 'failed', 'colo',
             'pre-switchover', 'device', 'wait-unplug' ] }
+##
+# @VfioStats:
+#
+# Detailed VFIO devices migration statistics
+#
+# @bytes: amount of bytes transferred to the target VM by VFIO devices
+#
+# Since: 5.1
+#
+##
+{ 'struct': 'VfioStats',
+  'data': {'bytes': 'int' } }
 
 ##
 # @MigrationInfo:
@@ -207,6 +219,10 @@
 #
 # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
 #
+# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
+#        only returned if VFIO device is present, migration is supported by all
+#         VFIO devices and status is 'active' or 'completed' (since 5.1)
+#
 # Since: 0.14.0
 ##
 { 'struct': 'MigrationInfo',
@@ -222,7 +238,8 @@
            '*postcopy-blocktime' : 'uint32',
            '*postcopy-vcpu-blocktime': ['uint32'],
            '*compression': 'CompressionStats',
-           '*socket-address': ['SocketAddress'] } }
+           '*socket-address': ['SocketAddress'],
+           '*vfio': 'VfioStats' } }
 
 ##
 # @query-migrate:
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 04/18] vfio: Add save and load functions for VFIO PCI devices
  2020-05-20 18:24 ` [PATCH QEMU v23 04/18] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2020-05-21  9:50   ` Dr. David Alan Gilbert
  2020-05-21 12:12     ` Kirti Wankhede
  0 siblings, 1 reply; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21  9:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> These functions save and restore PCI device specific data - config
> space of PCI device.
> Tested save and restore with MSI and MSIX type.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>

So I'm OK with this from the migration side, but I'd like Alex to check it from
the PCI side of things.

I still think that you'd be better:
  a) Using a VMStateDescription to encode the structure
  b) or at least adding a flag at the end so you can add more data later

Experience with every other device shows you're shooting yourself in the
foot by hard coding the layout and not giving yourself a chance to
expand it.

but for now,


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   2 +
>  2 files changed, 165 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 0514ba373d1c..94535f0e27cd 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -41,6 +41,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "migration/qemu-file.h"
>  
>  #define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    VFIOBAR *bar = &vdev->bars[nr];
> +    uint64_t addr;
> +    uint32_t addr_lo, addr_hi = 0;
> +
> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> +    if (!bar->size) {
> +        return 0;
> +    }
> +
> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> +
> +    addr_lo &= (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> +                              PCI_BASE_ADDRESS_MEM_MASK);
> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +        addr_hi = pci_default_read_config(pdev,
> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> +    }
> +
> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
> +
> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> +{
> +    int i, ret;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        ret = vfio_bar_validate(vdev, i);
> +        if (ret) {
> +            error_report("vfio: BAR address %d validation failed", i);
> +            return ret;
> +        }
> +    }
> +    return 0;
> +}
> +
>  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>  {
>      VFIOBAR *bar = &vdev->bars[nr];
> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>      return OBJECT(vdev);
>  }
>  
> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint16_t pci_cmd;
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar;
> +
> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar);
> +    }
> +
> +    qemu_put_be32(f, vdev->interrupt);
> +    if (vdev->interrupt == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                                            2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        qemu_put_be32(f, msi_addr_lo);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                             4);
> +        }
> +        qemu_put_be32(f, msi_addr_hi);
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +        qemu_put_be16(f, msi_data);
> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> +        uint16_t offset;
> +
> +        /* save enable bit and maskall bit */
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        qemu_put_be16(f, offset);
> +        msix_save(pdev, f);
> +    }
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    qemu_put_be16(f, pci_cmd);
> +}
> +
> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +    uint16_t pci_cmd;
> +    bool msi_64bit;
> +    int i, ret;
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = qemu_get_be32(f);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }
> +
> +    ret = vfio_bars_validate(vdev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +        msi_addr_lo = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        msi_addr_hi = qemu_get_be32(f);
> +        if (msi_64bit) {
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +        msi_data = qemu_get_be16(f);
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset = qemu_get_be16(f);
> +
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +        msix_load(pdev, f);
> +    }
> +    pci_cmd = qemu_get_be16(f);
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +    return 0;
> +}
> +
>  static VFIODeviceOps vfio_pci_ops = {
>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>      .vfio_eoi = vfio_intx_eoi,
>      .vfio_get_object = vfio_pci_get_object,
> +    .vfio_save_config = vfio_pci_save_config,
> +    .vfio_load_config = vfio_pci_load_config,
>  };
>  
>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 74261feaeac9..d69a7f3ae31e 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>      void (*vfio_eoi)(VFIODevice *vdev);
>      Object *(*vfio_get_object)(VFIODevice *vdev);
> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>  };
>  
>  typedef struct VFIOGroup {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 05/18] vfio: Add migration region initialization and finalize function
  2020-05-20 18:24 ` [PATCH QEMU v23 05/18] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2020-05-21  9:59   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21  9:59 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> - Migration functions are implemented for VFIO_DEVICE_TYPE_PCI device in this
>   patch series.
> - VFIO device supports migration or not is decided based of migration region
>   query. If migration region query is successful and migration region
>   initialization is successful then migration is supported else migration is
>   blocked.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>

For migration:


Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 138 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   3 +
>  include/hw/vfio/vfio-common.h |   9 +++
>  4 files changed, 151 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index 9bb1c09e8477..8b296c889ed9 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,4 +1,4 @@
> -obj-y += common.o spapr.o
> +obj-y += common.o spapr.o migration.o
>  obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_VFIO_PLATFORM) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..bf9384907ec0
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,138 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2020
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +#include "trace.h"
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.size) {
> +        vfio_region_exit(&migration->region);
> +        vfio_region_finalize(&migration->region);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!vbasedev->ops->vfio_get_object) {
> +        return ret;
> +    }
> +
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
> +                            "migration");
> +    if (ret) {
> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.size) {
> +        ret = -EINVAL;
> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_init(VFIODevice *vbasedev,
> +                               struct vfio_region_info *info)
> +{
> +    int ret;
> +
> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> +
> +    ret = vfio_migration_region_init(vbasedev, info->index);
> +    if (ret) {
> +        error_report("%s: Failed to initialise migration region",
> +                     vbasedev->name);
> +        g_free(vbasedev->migration);
> +        vbasedev->migration = NULL;
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +/* ---------------------------------------------------------------------- */
> +
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> +{
> +    struct vfio_region_info *info;
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    if (ret) {
> +        goto add_blocker;
> +    }
> +
> +    ret = vfio_migration_init(vbasedev, info);
> +    if (ret) {
> +        goto add_blocker;
> +    }
> +
> +    trace_vfio_migration_probe(vbasedev->name, info->index);
> +    return 0;
> +
> +add_blocker:
> +    error_setg(&vbasedev->migration_blocker,
> +               "VFIO device doesn't support migration");
> +    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +    return ret;
> +}
> +
> +void vfio_migration_finalize(VFIODevice *vbasedev)
> +{
> +    if (vbasedev->migration_blocker) {
> +        migrate_del_blocker(vbasedev->migration_blocker);
> +        error_free(vbasedev->migration_blocker);
> +    }
> +
> +    vfio_migration_region_exit(vbasedev);
> +    g_free(vbasedev->migration);
> +}
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 8cdc27946cb8..fd034ac53684 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
>  vfio_display_edid_link_down(void) ""
>  vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>  vfio_display_edid_write_error(void) ""
> +
> +# migration.c
> +vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index d69a7f3ae31e..d4b268641173 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -57,6 +57,10 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */
>  } VFIORegion;
>  
> +typedef struct VFIOMigration {
> +    VFIORegion region;
> +} VFIOMigration;
> +
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers;
> @@ -113,6 +117,8 @@ typedef struct VFIODevice {
>      unsigned int num_irqs;
>      unsigned int num_regions;
>      unsigned int flags;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> @@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>  int vfio_spapr_remove_window(VFIOContainer *container,
>                               hwaddr offset_within_address_space);
>  
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> +void vfio_migration_finalize(VFIODevice *vbasedev);
> +
>  #endif /* HW_VFIO_VFIO_COMMON_H */
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 06/18] vfio: Add VM state change handler to know state of VM
  2020-05-20 18:24 ` [PATCH QEMU v23 06/18] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2020-05-21 11:19   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 11:19 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> VM state change handler gets called on change in VM's state. This is used to set
> VFIO device state to _RUNNING.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  hw/vfio/migration.c           | 87 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  2 +
>  include/hw/vfio/vfio-common.h |  4 ++
>  3 files changed, 93 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index bf9384907ec0..e79b34003079 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -10,6 +10,7 @@
>  #include "qemu/osdep.h"
>  #include <linux/vfio.h>
>  
> +#include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> @@ -74,6 +75,85 @@ err:
>      return ret;
>  }
>  
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> +                                    uint32_t value)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint32_t device_state;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("%s: Failed to read device state %d %s",
> +                     vbasedev->name, ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    device_state = (device_state & mask) | value;
> +
> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> +        return -EINVAL;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("%s: Failed to set device state %d %s",
> +                     vbasedev->name, ret, strerror(errno));
> +
> +        ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                device_state));
> +        if (ret < 0) {
> +            error_report("%s: On failure, failed to read device state %d %s",
> +                    vbasedev->name, ret, strerror(errno));
> +            return ret;
> +        }
> +
> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
> +            error_report("%s: Device is in error state 0x%x",
> +                         vbasedev->name, device_state);
> +            return -EFAULT;
> +        }
> +    }
> +
> +    vbasedev->device_state = device_state;
> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> +    return 0;
> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running)) {
> +        int ret;
> +        uint32_t value = 0, mask = 0;
> +
> +        if (running) {
> +            value = VFIO_DEVICE_STATE_RUNNING;
> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
> +            }
> +        } else {
> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> +        }
> +
> +        ret = vfio_migration_set_state(vbasedev, mask, value);
> +        if (ret) {
> +            error_report("%s: Failed to set device state 0x%x",
> +                         vbasedev->name, value & mask);
> +        }
> +        vbasedev->vm_running = running;
> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                                  value & mask);
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -90,6 +170,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
> +
>      return 0;
>  }
>  
> @@ -128,6 +211,10 @@ add_blocker:
>  
>  void vfio_migration_finalize(VFIODevice *vbasedev)
>  {
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +    }
> +
>      if (vbasedev->migration_blocker) {
>          migrate_del_blocker(vbasedev->migration_blocker);
>          error_free(vbasedev->migration_blocker);
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index fd034ac53684..14b0a86c0035 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index d4b268641173..3d18eb146b33 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -29,6 +29,7 @@
>  #ifdef CONFIG_LINUX
>  #include <linux/vfio.h>
>  #endif
> +#include "sysemu/sysemu.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -119,6 +120,9 @@ typedef struct VFIODevice {
>      unsigned int flags;
>      VFIOMigration *migration;
>      Error *migration_blocker;
> +    VMChangeStateEntry *vm_state;
> +    uint32_t device_state;
> +    int vm_running;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 07/18] vfio: Add migration state change notifier
  2020-05-20 18:24 ` [PATCH QEMU v23 07/18] vfio: Add migration state change notifier Kirti Wankhede
@ 2020-05-21 11:31   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 11:31 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Added migration state change notifier to get notification on migration state
> change. These states are translated to VFIO device state and conveyed to vendor
> driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  hw/vfio/migration.c           | 30 ++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  5 +++--
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e79b34003079..c2f5564b51c3 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -154,6 +154,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>      }
>  }
>  
> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    trace_vfio_migration_state_notifier(vbasedev->name,
> +                                        MigrationStatus_str(s->state));
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> +                      VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +        }
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -173,6 +195,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>  
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> +
>      return 0;
>  }
>  
> @@ -211,6 +236,11 @@ add_blocker:
>  
>  void vfio_migration_finalize(VFIODevice *vbasedev)
>  {
> +
> +    if (vbasedev->migration_state.notify) {
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
>      if (vbasedev->vm_state) {
>          qemu_del_vm_change_state_handler(vbasedev->vm_state);
>      }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 14b0a86c0035..bd3d47b005cb 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -146,5 +146,6 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> -vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> -vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> +vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> +vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 3d18eb146b33..28f55f66d019 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>      VMChangeStateEntry *vm_state;
>      uint32_t device_state;
>      int vm_running;
> +    Notifier migration_state;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 04/18] vfio: Add save and load functions for VFIO PCI devices
  2020-05-21  9:50   ` Dr. David Alan Gilbert
@ 2020-05-21 12:12     ` Kirti Wankhede
  2020-05-21 19:28       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-21 12:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini



On 5/21/2020 3:20 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> These functions save and restore PCI device specific data - config
>> space of PCI device.
>> Tested save and restore with MSI and MSIX type.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> 
> So I'm OK with this from the migration side, but I'd like Alex to check it from
> the PCI side of things.
> 
> I still think that you'd be better:
>    a) Using a VMStateDescription to encode the structure

Can we use VMStateDescription and SaveVMHandlers at a same time for 
migration?

>    b) or at least adding a flag at the end so you can add more data later
> 

Sure, I'm thinking of this option.

> Experience with every other device shows you're shooting yourself in the
> foot by hard coding the layout and not giving yourself a chance to
> expand it.
> 
> but for now,
> 
 >
 > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks, this version of QEMU patches was posted without addressing 
comments on previous version so that others can test kernel interface 
for now. We need to freeze kernel interface and get those patches in 
v5.8 kernel.
I'll revisit all the remaining comments on v16 and v18 QEMU patches 
again and reiterate the patches. So this is not the final version.

Thanks,
Kirti

> 
>> ---
>>   hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/vfio/vfio-common.h |   2 +
>>   2 files changed, 165 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 0514ba373d1c..94535f0e27cd 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -41,6 +41,7 @@
>>   #include "trace.h"
>>   #include "qapi/error.h"
>>   #include "migration/blocker.h"
>> +#include "migration/qemu-file.h"
>>   
>>   #define TYPE_VFIO_PCI "vfio-pci"
>>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>>       }
>>   }
>>   
>> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
>> +{
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    VFIOBAR *bar = &vdev->bars[nr];
>> +    uint64_t addr;
>> +    uint32_t addr_lo, addr_hi = 0;
>> +
>> +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
>> +    if (!bar->size) {
>> +        return 0;
>> +    }
>> +
>> +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
>> +
>> +    addr_lo &= (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
>> +                              PCI_BASE_ADDRESS_MEM_MASK);
>> +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
>> +        addr_hi = pci_default_read_config(pdev,
>> +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
>> +    }
>> +
>> +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
>> +
>> +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
>> +{
>> +    int i, ret;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        ret = vfio_bar_validate(vdev, i);
>> +        if (ret) {
>> +            error_report("vfio: BAR address %d validation failed", i);
>> +            return ret;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>>   static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>   {
>>       VFIOBAR *bar = &vdev->bars[nr];
>> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>       return OBJECT(vdev);
>>   }
>>   
>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint16_t pci_cmd;
>> +    int i;
>> +
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar;
>> +
>> +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
>> +        qemu_put_be32(f, bar);
>> +    }
>> +
>> +    qemu_put_be32(f, vdev->interrupt);
>> +    if (vdev->interrupt == VFIO_INT_MSI) {
>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +        bool msi_64bit;
>> +
>> +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                                            2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        msi_addr_lo = pci_default_read_config(pdev,
>> +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>> +        qemu_put_be32(f, msi_addr_lo);
>> +
>> +        if (msi_64bit) {
>> +            msi_addr_hi = pci_default_read_config(pdev,
>> +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                             4);
>> +        }
>> +        qemu_put_be32(f, msi_addr_hi);
>> +
>> +        msi_data = pci_default_read_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                2);
>> +        qemu_put_be16(f, msi_data);
>> +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
>> +        uint16_t offset;
>> +
>> +        /* save enable bit and maskall bit */
>> +        offset = pci_default_read_config(pdev,
>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>> +        qemu_put_be16(f, offset);
>> +        msix_save(pdev, f);
>> +    }
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    qemu_put_be16(f, pci_cmd);
>> +}
>> +
>> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint32_t interrupt_type;
>> +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +    uint16_t pci_cmd;
>> +    bool msi_64bit;
>> +    int i, ret;
>> +
>> +    /* retore pci bar configuration */
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar = qemu_get_be32(f);
>> +
>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>> +    }
>> +
>> +    ret = vfio_bars_validate(vdev);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    interrupt_type = qemu_get_be32(f);
>> +
>> +    if (interrupt_type == VFIO_INT_MSI) {
>> +        /* restore msi configuration */
>> +        msi_flags = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>> +
>> +        msi_addr_lo = qemu_get_be32(f);
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>> +                              msi_addr_lo, 4);
>> +
>> +        msi_addr_hi = qemu_get_be32(f);
>> +        if (msi_64bit) {
>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                  msi_addr_hi, 4);
>> +        }
>> +        msi_data = qemu_get_be16(f);
>> +        vfio_pci_write_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                msi_data, 2);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>> +        uint16_t offset = qemu_get_be16(f);
>> +
>> +        /* load enable bit and maskall bit */
>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>> +                              offset, 2);
>> +        msix_load(pdev, f);
>> +    }
>> +    pci_cmd = qemu_get_be16(f);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
>> +    return 0;
>> +}
>> +
>>   static VFIODeviceOps vfio_pci_ops = {
>>       .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>>       .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>>       .vfio_eoi = vfio_intx_eoi,
>>       .vfio_get_object = vfio_pci_get_object,
>> +    .vfio_save_config = vfio_pci_save_config,
>> +    .vfio_load_config = vfio_pci_load_config,
>>   };
>>   
>>   int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 74261feaeac9..d69a7f3ae31e 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>>       int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>>       void (*vfio_eoi)(VFIODevice *vdev);
>>       Object *(*vfio_get_object)(VFIODevice *vdev);
>> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
>> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>>   };
>>   
>>   typedef struct VFIOGroup {
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 08/18] vfio: Register SaveVMHandlers for VFIO device
  2020-05-20 18:24 ` [PATCH QEMU v23 08/18] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2020-05-21 14:18   ` Dr. David Alan Gilbert
  2020-05-21 18:00     ` Kirti Wankhede
  0 siblings, 1 reply; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 14:18 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 75 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index c2f5564b51c3..773c8d16b1c1 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,12 +8,14 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> +#include "migration/vmstate.h"
>  #include "migration/qemu-file.h"
>  #include "migration/register.h"
>  #include "migration/blocker.h"
> @@ -24,6 +26,17 @@
>  #include "pci.h"
>  #include "trace.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -126,6 +139,64 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region);
> +        qemu_mutex_unlock_iothread();
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.index,
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.mmaps) {
> +        vfio_region_unmap(&migration->region);
> +    }
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -192,6 +263,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    register_savevm_live("vfio", VMSTATE_INSTANCE_ID_ANY, 1,
> +                         &savevm_vfio_handlers, vbasedev);

Hi,
  This is still the only bit which worries me, and I saw your note
saying you'd tested it; to calm my nerves, can you run with the
'qemu_loadvm_state_section_startfull' trace enabled with 2 devices
and show me the output and qemu command line?
I'm trying to figure out how they end up represented in the stream.

Dave


>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bd3d47b005cb..86c18def016e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,3 +149,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> +vfio_save_setup(const char *name) " (%s)"
> +vfio_save_cleanup(const char *name) " (%s)"
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 09/18] vfio: Add save state functions to SaveVMHandlers
  2020-05-20 18:24 ` [PATCH QEMU v23 09/18] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2020-05-21 15:37   ` Dr. David Alan Gilbert
  2020-05-21 20:43   ` Peter Xu
  1 sibling, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 15:37 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes. If pending_bytes > 0, go through below steps.
> - read data_offset - indicates kernel driver to write data to staging
>   buffer.
> - read data_size - amount of data in bytes written by vendor driver in
>   migration region.
> - read data_size bytes of data from data_offset in the migration region.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> c. read data_offset - indicates kernel driver to write data to staging
>    buffer.
> d. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> e. read data_size bytes of data from data_offset in the migration region.
> f. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f while (pending_bytes > 0)
> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from
> data_offset of data_size before moving to next steps.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 254 +++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/trace-events          |   6 +
>  include/hw/vfio/vfio-common.h |   1 +
>  3 files changed, 260 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 773c8d16b1c1..d90bd2296afd 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -139,6 +139,137 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +static void *find_data_region(VFIORegion *region,
> +                              uint64_t data_offset,
> +                              uint64_t data_size)
> +{
> +    void *ptr = NULL;
> +    int i;
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        if ((data_offset >= region->mmaps[i].offset) &&

that checks the bit we want to map starts at or after this region

> +            (data_offset < region->mmaps[i].offset + region->mmaps[i].size) &&

and the bit we want to map starts before the end of this region

> +            (data_size <= region->mmaps[i].size)) {

Shouldn't that check take into account the offset?  If 
data_offset-mmaps[i].offset is almost mmaps[i].size then we would be
accessing data_size after the end of the region?

> +            ptr = region->mmaps[i].mmap + (data_offset -
> +                                           region->mmaps[i].offset);
> +            break;
> +        }
> +    }
> +    return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)

          ^^^ do you need to make this int64_t or ssize_t - internally
to this function it's uint64_t but you're also returning errors.
(and then fix up the trace_ and callers below)

> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t data_offset = 0, data_size = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret != sizeof(data_offset)) {
> +        error_report("%s: Failed to get migration buffer data offset %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret != sizeof(data_size)) {
> +        error_report("%s: Failed to get migration buffer data size %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    if (data_size > 0) {
> +        void *buf = NULL;
> +        bool buffer_mmaped;
> +
> +        if (region->mmaps) {
> +            buf = find_data_region(region, data_offset, data_size);
> +        }
> +
> +        buffer_mmaped = (buf != NULL);
> +
> +        if (!buffer_mmaped) {
> +            buf = g_try_malloc(data_size);
> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                return -ENOMEM;
> +            }
> +
> +            ret = pread(vbasedev->fd, buf, data_size,
> +                        region->fd_offset + data_offset);
> +            if (ret != data_size) {
> +                error_report("%s: Failed to get migration data %d",
> +                             vbasedev->name, ret);
> +                g_free(buf);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_be64(f, data_size);
> +        qemu_put_buffer(f, buf, data_size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +    } else {
> +        qemu_put_be64(f, data_size);
> +    }
> +
> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +                           migration->pending_bytes);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return data_size;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {

You read values from the vfio buffer in a few places and don't check the
sizeof() in ret; this either needs wrapping in a
function/macro/something or doing consistently.

> +        error_report("%s: Failed to get pending bytes %d",
> +                     vbasedev->name, ret);
> +        migration->pending_bytes = 0;
> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    trace_vfio_save_device_config_state(vbasedev->name);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -157,7 +288,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          qemu_mutex_unlock_iothread();
>          if (ret) {
>              error_report("%s: Failed to mmap VFIO migration region %d: %s",
> -                         vbasedev->name, migration->region.index,
> +                         vbasedev->name, migration->region.nr,
>                           strerror(-ret));
>              return ret;
>          }
> @@ -190,9 +321,130 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    *res_precopy_only += migration->pending_bytes;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> +                            *res_postcopy_only, *res_compatible);
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret, data_size;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    if (migration->pending_bytes == 0) {
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +
> +        if (migration->pending_bytes == 0) {
> +            /* indicates data finished, goto complete phase */

Is that right that you've sent a DEV_DATA_STATE but not an END_OF_STATE?

> +            return 1;
> +        }
> +    }
> +
> +    data_size = vfio_save_buffer(f, vbasedev);
> +
> +    if (data_size < 0) {
> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> +                     strerror(errno));
> +        return data_size;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOP and SAVING",
> +                     vbasedev->name);
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev);
> +        if (ret < 0) {
> +            error_report("%s: Failed to save buffer", vbasedev->name);
> +            return ret;
> +        } else if (ret == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
> +        return ret;
> +    }
> +
> +    trace_vfio_save_complete_precopy(vbasedev->name);
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 86c18def016e..9a1c5e17d97f 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -151,3 +151,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>  vfio_save_setup(const char *name) " (%s)"
>  vfio_save_cleanup(const char *name) " (%s)"
> +vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
> +vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
> +vfio_save_device_config_state(const char *name) " (%s)"
> +vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
> +vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
> +vfio_save_complete_precopy(const char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 28f55f66d019..c78033e4149d 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -60,6 +60,7 @@ typedef struct VFIORegion {
>  
>  typedef struct VFIOMigration {
>      VFIORegion region;
> +    uint64_t pending_bytes;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 10/18] vfio: Add load state functions to SaveVMHandlers
  2020-05-20 18:24 ` [PATCH QEMU v23 10/18] vfio: Add load " Kirti Wankhede
@ 2020-05-21 15:53   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 15:53 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Sequence  during _RESUMING device state:
> While data for this device is available, repeat below steps:
> a. read data_offset from where user application should write data.
> b. write data of data_size to migration region from data_offset.
> c. write data_size which indicates vendor driver that data is written in
>    staging buffer.
> 
> For user, data is opaque. User should write data in the same order as
> received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |   3 +
>  2 files changed, 183 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index d90bd2296afd..b9bbe38e539c 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -270,6 +270,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    uint64_t data;
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> +        int ret;
> +
> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> +        if (ret) {
> +            error_report("%s: Failed to load device config space",
> +                         vbasedev->name);
> +            return ret;
> +        }
> +    }
> +
> +    data = qemu_get_be64(f);
> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("%s: Failed loading device config space, "
> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> +        return -EINVAL;
> +    }
> +
> +    trace_vfio_load_device_config_state(vbasedev->name);
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -439,12 +466,165 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      return ret;
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +
> +    if (migration->region.mmaps) {
> +        ret = vfio_region_mmap(&migration->region);
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
> +                                   VFIO_DEVICE_STATE_RESUMING);
> +    if (ret) {
> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> +    }
> +    return ret;
> +}
> +
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    vfio_save_cleanup(opaque);
> +    return 0;
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +    uint64_t data, data_size;
> +
> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +
> +        trace_vfio_load_state(vbasedev->name, data);
> +
> +        switch (data) {
> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> +        {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> +        {
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return ret;
> +            } else {
> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
> +                             vbasedev->name, data);
> +                return -EINVAL;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> +        {
> +            VFIORegion *region = &migration->region;
> +            void *buf = NULL;
> +            bool buffer_mmaped = false;
> +            uint64_t data_offset = 0;
> +
> +            data_size = qemu_get_be64(f);
> +            if (data_size == 0) {
> +                break;
> +            }
> +
> +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                        region->fd_offset +
> +                        offsetof(struct vfio_device_migration_info,
> +                        data_offset));
> +            if (ret != sizeof(data_offset)) {
> +                error_report("%s:Failed to get migration buffer data offset %d",
> +                             vbasedev->name, ret);
> +                return -EINVAL;
> +            }
> +
> +            if (region->mmaps) {
> +                buf = find_data_region(region, data_offset, data_size);
> +            }
> +
> +            buffer_mmaped = (buf != NULL) ? true : false;
> +
> +            if (!buffer_mmaped) {
> +                buf = g_try_malloc0(data_size);
> +                if (!buf) {
> +                    error_report("%s: Error allocating buffer ", __func__);
> +                    return -ENOMEM;
> +                }
> +            }
> +
> +            qemu_get_buffer(f, buf, data_size);
> +
> +            if (!buffer_mmaped) {
> +                ret = pwrite(vbasedev->fd, buf, data_size,
> +                             region->fd_offset + data_offset);
> +                g_free(buf);
> +
> +                if (ret != data_size) {
> +                    error_report("%s: Failed to set migration buffer %d",
> +                                 vbasedev->name, ret);
> +                    return -EINVAL;
> +                }
> +            }
> +
> +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
> +                         region->fd_offset +
> +                       offsetof(struct vfio_device_migration_info, data_size));
> +            if (ret != sizeof(data_size)) {
> +                error_report("%s: Failed to set migration buffer data size %d",
> +                             vbasedev->name, ret);
> +                if (!buffer_mmaped) {
> +                    g_free(buf);
> +                }
> +                return -EINVAL;
> +            }
> +
> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> +                                              data_size);
> +            break;
> +        }
> +
> +        default:
> +            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
> +            return -EINVAL;
> +        }
> +
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }

You can probably omit this check and just do the one below; other than
that:


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> +        data = qemu_get_be64(f);
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
>      .save_live_pending = vfio_save_pending,
>      .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .load_state = vfio_load_state,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 9a1c5e17d97f..4a4bd3ba9a2a 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -157,3 +157,6 @@ vfio_save_device_config_state(const char *name) " (%s)"
>  vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>  vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
>  vfio_save_complete_precopy(const char *name) " (%s)"
> +vfio_load_device_config_state(const char *name) " (%s)"
> +vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
> +vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 11/18] iommu: add callback to get address limit IOMMU supports
  2020-05-20 18:24 ` [PATCH QEMU v23 11/18] iommu: add callback to get address limit IOMMU supports Kirti Wankhede
@ 2020-05-21 16:13   ` Peter Xu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Xu @ 2020-05-21 16:13 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Wed, May 20, 2020 at 11:54:41PM +0530, Kirti Wankhede wrote:
> +hwaddr memory_region_iommu_get_address_limit(IOMMUMemoryRegion *iommu_mr)
> +{
> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
> +
> +    if (imrc->get_address_limit) {
> +        return imrc->get_address_limit(iommu_mr);
> +    }
> +
> +    return 0;

LGTM, besides.. Would it be better to return (hwaddr)-1 as default here?
Because if an IOMMU type didn't provide this information, IMHO we'd better
cover the whole possible address range.

Thanks,

> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 12/18] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  2020-05-20 18:24 ` [PATCH QEMU v23 12/18] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
@ 2020-05-21 16:20   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 16:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>

can you add a more detailed commit message here please; I don't quite
get the purpose; Alex previously speculated this is about adding it to
dirty tracking; but please explain it in the commit message.

Dave

> ---
>  memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/memory.c b/memory.c
> index a6c69e1391ba..27f22f1a9974 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1788,7 +1788,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
>  uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>  {
>      uint8_t mask = mr->dirty_log_mask;
> -    if (global_dirty_log && mr->ram_block) {
> +    if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
>          mask |= (1 << DIRTY_MEMORY_MIGRATION);
>      }
>      return mask;
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 14/18] vfio: Add function to start and stop dirty pages tracking
  2020-05-20 18:24 ` [PATCH QEMU v23 14/18] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
@ 2020-05-21 16:50   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 16:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
> for VFIO devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>

This does seem to match my reading of your 4/8 ioctl sefinition patch,
so:

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

I must admit I don't see why 'argsz' is needed.

Dave

> ---
>  hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index b9bbe38e539c..7729c90782bd 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -9,6 +9,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/main-loop.h"
> +#include <sys/ioctl.h>
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
> @@ -297,6 +298,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
> +{
> +    int ret;
> +    VFIOContainer *container = vbasedev->group->container;
> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
> +        .argsz = sizeof(dirty),
> +    };
> +
> +    if (start) {
> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
> +        } else {
> +            return -EINVAL;
> +        }
> +    } else {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
> +    if (ret) {
> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
> +                     dirty.flags, errno);
> +    }
> +    return ret;
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -327,6 +354,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> +    ret = vfio_start_dirty_page_tracking(vbasedev, true);
> +    if (ret) {
> +        return ret;
> +    }
> +
>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>  
>      ret = qemu_file_get_error(f);
> @@ -342,6 +374,8 @@ static void vfio_save_cleanup(void *opaque)
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>  
> +    vfio_start_dirty_page_tracking(vbasedev, false);
> +
>      if (migration->region.mmaps) {
>          vfio_region_unmap(&migration->region);
>      }
> @@ -676,6 +710,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>          if (ret) {
>              error_report("%s: Failed to set state RUNNING", vbasedev->name);
>          }
> +
> +        vfio_start_dirty_page_tracking(vbasedev, false);
>      }
>  }
>  
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 08/18] vfio: Register SaveVMHandlers for VFIO device
  2020-05-21 14:18   ` Dr. David Alan Gilbert
@ 2020-05-21 18:00     ` Kirti Wankhede
  2020-05-21 19:35       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 39+ messages in thread
From: Kirti Wankhede @ 2020-05-21 18:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, eskultet, ziye.yang, armbru, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, yan.y.zhao, alex.williamson,
	changpeng.liu, quintela, zhi.a.wang, jonathan.davies, pbonzini



On 5/21/2020 7:48 PM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> Define flags to be used as delimeter in migration file stream.
>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>> region from these functions at source during saving or pre-copy phase.
>> Set VFIO device state depending on VM's state. During live migration, VM is
>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |  2 ++
>>   2 files changed, 75 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index c2f5564b51c3..773c8d16b1c1 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -8,12 +8,14 @@
>>    */
>>   
>>   #include "qemu/osdep.h"
>> +#include "qemu/main-loop.h"
>>   #include <linux/vfio.h>
>>   
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "cpu.h"
>>   #include "migration/migration.h"
>> +#include "migration/vmstate.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration/register.h"
>>   #include "migration/blocker.h"
>> @@ -24,6 +26,17 @@
>>   #include "pci.h"
>>   #include "trace.h"
>>   
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>> +
>>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>> @@ -126,6 +139,64 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +/* ---------------------------------------------------------------------- */
>> +
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    trace_vfio_save_setup(vbasedev->name);
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +
>> +    if (migration->region.mmaps) {
>> +        qemu_mutex_lock_iothread();
>> +        ret = vfio_region_mmap(&migration->region);
>> +        qemu_mutex_unlock_iothread();
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.index,
>> +                         strerror(-ret));
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~0, VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
>> +        return ret;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->region.mmaps) {
>> +        vfio_region_unmap(&migration->region);
>> +    }
>> +    trace_vfio_save_cleanup(vbasedev->name);
>> +}
>> +
>> +static SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +};
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>>   static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -192,6 +263,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>           return ret;
>>       }
>>   
>> +    register_savevm_live("vfio", VMSTATE_INSTANCE_ID_ANY, 1,
>> +                         &savevm_vfio_handlers, vbasedev);
> 
> Hi,
>    This is still the only bit which worries me, and I saw your note
> saying you'd tested it; to calm my nerves, can you run with the
> 'qemu_loadvm_state_section_startfull' trace enabled with 2 devices
> and show me the output and qemu command line?
> I'm trying to figure out how they end up represented in the stream.
> 

Created mtty devices for source VM:
echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1233" > 
/sys/class/mdev_bus/mtty/mdev_supported_types/mtty-2/create
echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1234" > 
/sys/class/mdev_bus/mtty/mdev_supported_types/mtty-2/create

for destination VM:
echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1235" > 
/sys/class/mdev_bus/mtty/mdev_supported_types/mtty-2/create
echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1236" > 
/sys/class/mdev_bus/mtty/mdev_supported_types/mtty-2/create

Source qemu-cmdline:
/usr/libexec/qemu-kvm \
  -name guest=rhel75-mig,debug-threads=on \
  -machine pc-i440fx-3.1,accel=kvm,usb=off,dump-guest-core=off \
  -cpu SandyBridge,vme=on,hypervisor=on,arat=on,xsaveopt=on \
  -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 \
  -uuid eefb718c-137c-d416-e573-dd74ecd3490d \
  -drive 
file=/home/vm/rhel-75.qcow2,format=qcow2,if=none,id=drive-ide0-0-0,cache=none 
\
  -device 
ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1,write-cache=on 
\
  -vnc 127.0.0.1:0 \
  -device rtl8139,netdev=net0,mac=52:54:b2:88:86:2a,bus=pci.0,addr=0x3 
-netdev tap,id=net0,script=/root/qemu-ifup,downscript=no \
  -device 
vfio-pci,sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1233 
\
  -device 
vfio-pci,sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1234 
\
  --trace events=/root/vfio_events \
  -monitor unix:/tmp/qmp_socket1,server,nowait \
  -serial stdio \
  -msg timestamp=on

Destination qemu-cmdline:
/usr/libexec/qemu-kvm \
  -name guest=rhel75-mig,debug-threads=on \
  -machine pc-i440fx-3.1,accel=kvm,usb=off,dump-guest-core=off \
  -cpu SandyBridge,vme=on,hypervisor=on,arat=on,xsaveopt=on \
  -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 \
  -uuid eefb718c-137c-d416-e573-dd74ecd3490d \
  -drive 
file=/home/vm/rhel-75.qcow2,format=qcow2,if=none,id=drive-ide0-0-0,cache=none 
\
  -device 
ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1,write-cache=on 
\
  -vnc 127.0.0.1:1 \
  -device rtl8139,netdev=net0,mac=52:54:b2:88:86:2a,bus=pci.0,addr=0x3 
-netdev tap,id=net0,script=/root/qemu-ifup,downscript=no \
  -device 
vfio-pci,sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1235 
\
  -device 
vfio-pci,sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1236 
\
  -incoming unix:/tmp/mig_socket \
  --trace events=/root/vfio_events \
  -monitor unix:/tmp/qmp_socket2,server,nowait \
  -serial stdio \
  -msg timestamp=on

Migrate:
echo "migrate_set_speed 0" | sudo nc -U /tmp/qmp_socket1
echo "migrate -d unix:/tmp/mig_socket" | sudo nc -U $/tmp/qmp_socket1

After migration, 'qemu_loadvm_state_section_startfull' traces:

qemu_loadvm_state_section_startfull 0.000 pid=1457 section_id=0x2 
idstr=b'ram' instance_id=0x0 version_id=0x4
qemu_loadvm_state_section_startfull 515.606 pid=1457 section_id=0x2e 
idstr=b'vfio' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 10.661 pid=1457 section_id=0x2f 
idstr=b'vfio' instance_id=0x1 version_id=0x1
qemu_loadvm_state_section_startfull 1120000.237 pid=1457 section_id=0x0 
idstr=b'timer' instance_id=0x0 version_id=0x2
qemu_loadvm_state_section_startfull 9.058 pid=1457 section_id=0x4 
idstr=b'cpu_common' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 26.453 pid=1457 section_id=0x5 
idstr=b'cpu' instance_id=0x0 version_id=0xc
qemu_loadvm_state_section_startfull 105.173 pid=1457 section_id=0x6 
idstr=b'kvm-tpr-opt' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 940.028 pid=1457 section_id=0x7 
idstr=b'apic' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 69.939 pid=1457 section_id=0x8 
idstr=b'cpu_common' instance_id=0x1 version_id=0x1
qemu_loadvm_state_section_startfull 14.319 pid=1457 section_id=0x9 
idstr=b'cpu' instance_id=0x1 version_id=0xc
qemu_loadvm_state_section_startfull 102.986 pid=1457 section_id=0xa 
idstr=b'apic' instance_id=0x1 version_id=0x3
qemu_loadvm_state_section_startfull 107.910 pid=1457 section_id=0xb 
idstr=b'kvmclock' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 8.349 pid=1457 section_id=0xc 
idstr=b'fw_cfg' instance_id=0x0 version_id=0x2
qemu_loadvm_state_section_startfull 8.603 pid=1457 section_id=0xd 
idstr=b'PCIBUS' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 6.557 pid=1457 section_id=0xe 
idstr=b'0000:00:00.0/I440FX' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 633.727 pid=1457 section_id=0xf 
idstr=b'0000:00:01.0/PIIX3' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 14.907 pid=1457 section_id=0x10 
idstr=b'i8259' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 11.465 pid=1457 section_id=0x11 
idstr=b'i8259' instance_id=0x1 version_id=0x1
qemu_loadvm_state_section_startfull 5.663 pid=1457 section_id=0x12 
idstr=b'ioapic' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 11.787 pid=1457 section_id=0x13 
idstr=b'0000:00:02.0/vga' instance_id=0x0 version_id=0x2
qemu_loadvm_state_section_startfull 1718.618 pid=1457 section_id=0x14 
idstr=b'hpet' instance_id=0x0 version_id=0x2
qemu_loadvm_state_section_startfull 16.212 pid=1457 section_id=0x15 
idstr=b'mc146818rtc' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 9.946 pid=1457 section_id=0x16 
idstr=b'i8254' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 12.879 pid=1457 section_id=0x17 
idstr=b'pcspk' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 3.115 pid=1457 section_id=0x18 
idstr=b'dma' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 10.432 pid=1457 section_id=0x19 
idstr=b'dma' instance_id=0x1 version_id=0x1
qemu_loadvm_state_section_startfull 12.263 pid=1457 section_id=0x1a 
idstr=b'serial' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 7.299 pid=1457 section_id=0x1b 
idstr=b'parallel_isa' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 3.399 pid=1457 section_id=0x1c 
idstr=b'fdc' instance_id=0x0 version_id=0x2
qemu_loadvm_state_section_startfull 33.307 pid=1457 section_id=0x1d 
idstr=b'ps2kbd' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 6.961 pid=1457 section_id=0x1e 
idstr=b'ps2mouse' instance_id=0x0 version_id=0x2
qemu_loadvm_state_section_startfull 5.485 pid=1457 section_id=0x1f 
idstr=b'pckbd' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 3.984 pid=1457 section_id=0x20 
idstr=b'vmmouse' instance_id=0x0 version_id=0x0
qemu_loadvm_state_section_startfull 105.948 pid=1457 section_id=0x21 
idstr=b'port92' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 2.443 pid=1457 section_id=0x22 
idstr=b'0000:00:01.1/ide' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 1094.861 pid=1457 section_id=0x23 
idstr=b'i2c_bus' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 3.416 pid=1457 section_id=0x24 
idstr=b'0000:00:01.3/piix4_pm' instance_id=0x0 version_id=0x3
qemu_loadvm_state_section_startfull 2266.518 pid=1457 section_id=0x2d 
idstr=b'0000:00:03.0/rtl8139' instance_id=0x0 version_id=0x5
qemu_loadvm_state_section_startfull 1619.840 pid=1457 section_id=0x30 
idstr=b'acpi_build' instance_id=0x0 version_id=0x1
qemu_loadvm_state_section_startfull 4.200 pid=1457 section_id=0x31 
idstr=b'globalstate' instance_id=0x0 version_id=0x1

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 15/18] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-05-20 18:24 ` [PATCH QEMU v23 15/18] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
@ 2020-05-21 18:52   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 18:52 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> vfio_listener_log_sync gets list of dirty pages from container using
> VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
> devices are stopped and saving state.
> Return early for the RAM block section of mapped MMIO region.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c     | 184 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio/trace-events |   1 +
>  2 files changed, 180 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index e0d3d4585a65..3f7049fbd1b0 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "hw/vfio/vfio.h"
>  #include "exec/address-spaces.h"
>  #include "exec/memory.h"
> +#include "exec/ram_addr.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
> @@ -38,6 +39,7 @@
>  #include "sysemu/reset.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> +#include "migration/migration.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
>  };
>  
>  /*
> + * Device state interfaces
> + */
> +
> +static bool vfio_devices_are_stopped_and_saving(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> +                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +                continue;
> +            } else {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
> +/*
>   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>   */
>  static int vfio_dma_unmap(VFIOContainer *container,
> @@ -408,8 +432,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  }
>  
>  /* Called with rcu_read_lock held.  */
> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> -                           bool *read_only)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> +                               ram_addr_t *ram_addr, bool *read_only)
>  {
>      MemoryRegion *mr;
>      hwaddr xlat;
> @@ -440,9 +464,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>          return false;
>      }
>  
> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -    *read_only = !writable || mr->readonly;
> +    if (vaddr) {
> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +    }
>  
> +    if (ram_addr) {
> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> +    }
> +
> +    if (read_only) {
> +        *read_only = !writable || mr->readonly;
> +    }
>      return true;
>  }
>  
> @@ -467,7 +499,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      rcu_read_lock();
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
>              goto out;
>          }
>          /*
> @@ -813,9 +845,151 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static int vfio_get_dirty_bitmap(MemoryListener *listener,
> +                                 MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    VFIOGuestIOMMU *giommu = NULL;
> +    IOMMUTLBEntry iotlb;
> +    hwaddr granularity, iova, iova_end;
> +    int ret;
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +            if (MEMORY_REGION(giommu->iommu) == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
> +                break;
> +            }
> +        }
> +
> +        if (!giommu) {
> +            return -EINVAL;
> +        }
> +    }
> +
> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        hwaddr iova_size;
> +
> +        granularity = memory_region_iommu_get_min_page_size(giommu->iommu);
> +        iova_end = memory_region_iommu_get_address_limit(giommu->iommu);
> +
> +        if (iova_end) {
> +            iova_size = MIN(int128_get64(section->size), iova_end - iova + 1);
> +        } else {
> +            iova_size = int128_get64(section->size);
> +        }
> +
> +        iova_end = iova + iova_size - 1;
> +    } else {
> +        granularity = memory_region_size(section->mr);
> +        iova_end = iova + int128_get64(section->size) - 1;
> +    }
> +
> +    RCU_READ_LOCK_GUARD();
> +
> +    while (iova < iova_end) {
> +        struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> +        struct vfio_iommu_type1_dirty_bitmap_get *range;
> +        ram_addr_t start, pages;
> +        uint64_t iova_xlat, size;
> +
> +        if (memory_region_is_iommu(section->mr)) {
> +            iotlb = address_space_get_iotlb_entry(container->space->as, iova,
> +                                                 true, MEMTXATTRS_UNSPECIFIED);
> +            if ((iotlb.target_as == NULL) || (iotlb.addr_mask == 0)) {
> +                if ((iova + granularity) < iova) {
> +                    break;
> +                }
> +                iova += granularity;

What's ging on here?  Is this trying to find the start of a valid area -
why would it happen?  It seems pretty expensive if it's going to walk a
page at a time.

> +                continue;
> +            }
> +            iova_xlat = iotlb.iova + giommu->iommu_offset;
> +            size = iotlb.addr_mask + 1;
> +        } else {
> +            iova_xlat = iova;
> +            size = iova_end - iova + 1;
> +        }
> +
> +        dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));

This is a constant size, it can be allocated outside the loop; or just
on the stack given it's also small.

> +        dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> +        dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +        range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
> +        range->iova = iova_xlat;
> +        range->size = size;
> +
> +        /*
> +         * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> +         * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
> +         * TARGET_PAGE_SIZE.
> +         */
> +        range->bitmap.pgsize = TARGET_PAGE_SIZE;
> +
> +        pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
> +        range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> +                             BITS_PER_BYTE;

A little long winded; if sizeof(__u64) or BITS_PER_BYTE ever change
we're really in trouble!
(A lot of the rest of the dirtymap code unfortunately uses long's - this
is really __u64?)

> +        range->bitmap.data = g_malloc0(range->bitmap.size);

Given that could be large, you should use g_try_malloc0 and check for
failure.

> +
> +        ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> +        if (ret) {
> +            error_report("Failed to get dirty bitmap for iova: 0x%llx "
> +                         "size: 0x%llx err: %d",
> +                         range->iova, range->size, errno);
> +            goto err_out;
> +        }
> +
> +        if (memory_region_is_iommu(section->mr)) {
> +            if (!vfio_get_xlat_addr(&iotlb, NULL, &start, NULL)) {
> +                ret = -EINVAL;
> +                goto err_out;
> +            }
> +        } else {
> +            start = memory_region_get_ram_addr(section->mr) +
> +                    section->offset_within_region + iova -
> +                    TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
> +                                               start, pages);
> +
> +        trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
> +                                    range->bitmap.size, start);
> +err_out:
> +        g_free(range->bitmap.data);
> +        g_free(dbitmap);
> +
> +        if (ret) {
> +            return ret;
> +        }
> +
> +        if ((iova + size) < iova) {
> +            break;
> +        }
> +
> +        iova += size;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_listerner_log_sync(MemoryListener *listener,
> +        MemoryRegionSection *section)
> +{
> +    if (vfio_listener_skipped_section(section)) {
> +        return;
> +    }
> +
> +    if (vfio_devices_are_stopped_and_saving()) {
> +        vfio_get_dirty_bitmap(listener, section);
> +    }
> +}
> +
>  static const MemoryListener vfio_memory_listener = {
>      .region_add = vfio_listener_region_add,
>      .region_del = vfio_listener_region_del,
> +    .log_sync = vfio_listerner_log_sync,
>  };
>  
>  static void vfio_listener_release(VFIOContainer *container)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 4a4bd3ba9a2a..c61ae4f3ead8 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(const char *name) " (%s)"
>  vfio_load_device_config_state(const char *name) " (%s)"
>  vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>  vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 16/18] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-05-20 18:24 ` [PATCH QEMU v23 16/18] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
@ 2020-05-21 19:05   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 19:05 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> With vIOMMU, IO virtual address range can get unmapped while in pre-copy
> phase of migration. In that case, unmap ioctl should return pages pinned
> in that range and QEMU should find its correcponding guest physical
> addresses and report those dirty.
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 80 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 3f7049fbd1b0..2dd9e8b24788 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -311,11 +311,82 @@ static bool vfio_devices_are_stopped_and_saving(void)
>      return true;
>  }
>  
> +static bool vfio_devices_are_running_and_saving(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> +                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +                continue;
> +            } else {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
> +static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> +                                 hwaddr iova, ram_addr_t size,
> +                                 IOMMUTLBEntry *iotlb)
> +{
> +    struct vfio_iommu_type1_dma_unmap *unmap;
> +    struct vfio_bitmap *bitmap;
> +    uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
> +    int ret;
> +
> +    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> +    if (!unmap) {
> +        return -ENOMEM;
> +    }

Again that's a fixed size thing so you don't really need to malloc it;
g_malloc0 won't return NULL; it'll just assert (which it shouldn't
do anyway because those structures are small).  Personally I'd just put
it on the stack.

> +    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> +    unmap->iova = iova;
> +    unmap->size = size;
> +    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
> +    bitmap = (struct vfio_bitmap *)&unmap->data;
> +
> +    /*
> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> +     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
> +     * TARGET_PAGE_SIZE.
> +     */
> +
> +    bitmap->pgsize = TARGET_PAGE_SIZE;
> +    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> +                   BITS_PER_BYTE;
> +
> +    if (bitmap->size > container->max_dirty_bitmap_size) {
> +        error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
> +        ret = -E2BIG;
> +        goto unmap_exit;
> +    }
> +
> +    bitmap->data = g_malloc0(bitmap->size);

That one should be a g_try_malloc0 and check it's non-null.

> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
> +    if (!ret) {
> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
> +                iotlb->translated_addr, pages);
> +    } else {
> +        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %d", -errno);

You're allowed touse %m in hw/vfio for printing the error nicely.

> +    }
> +
> +    g_free(bitmap->data);
> +unmap_exit:
> +    g_free(unmap);
> +    return ret;
> +}
> +
>  /*
>   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>   */
>  static int vfio_dma_unmap(VFIOContainer *container,
> -                          hwaddr iova, ram_addr_t size)
> +                          hwaddr iova, ram_addr_t size,
> +                          IOMMUTLBEntry *iotlb)
>  {
>      struct vfio_iommu_type1_dma_unmap unmap = {
>          .argsz = sizeof(unmap),
> @@ -324,6 +395,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> +    if (iotlb && container->dirty_pages_supported &&
> +        vfio_devices_are_running_and_saving()) {
> +        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> +    }
> +
>      while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>          /*
>           * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> @@ -371,7 +447,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>       * the VGA ROM space.
>       */
>      if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
> +        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
>           ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>          return 0;
>      }
> @@ -519,7 +595,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>                           iotlb->addr_mask + 1, vaddr, ret);
>          }
>      } else {
> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -822,7 +898,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  
>      if (try_unmap) {
> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 17/18] vfio: Make vfio-pci device migration capable
  2020-05-20 18:24 ` [PATCH QEMU v23 17/18] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2020-05-21 19:16   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 19:16 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> If device is not failover primary device call vfio_migration_probe()
> and vfio_migration_finalize() functions for vfio-pci device to enable
> migration for vfio PCI device which support migration.
> Removed vfio_pci_vmstate structure.
> Removed migration blocker from VFIO PCI device specific structure and use
> migration blocker from generic structure of  VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c | 32 +++++++++++---------------------
>  hw/vfio/pci.h |  1 -
>  2 files changed, 11 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 94535f0e27cd..573b28753bfd 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2916,22 +2916,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          return;
>      }
>  
> -    if (!pdev->failover_pair_id) {
> -        error_setg(&vdev->migration_blocker,
> -                "VFIO device doesn't support migration");
> -        ret = migrate_add_blocker(vdev->migration_blocker, &err);
> -        if (ret) {
> -            error_propagate(errp, err);
> -            error_free(vdev->migration_blocker);
> -            vdev->migration_blocker = NULL;
> -            return;
> -        }
> -    }
> -
>      vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
>      vdev->vbasedev.ops = &vfio_pci_ops;
>      vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
>      vdev->vbasedev.dev = DEVICE(vdev);
> +    vdev->vbasedev.device_state = 0;
>  
>      tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
>      len = readlink(tmp, group_path, sizeof(group_path));
> @@ -3195,6 +3184,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          }
>      }
>  
> +    if (!pdev->failover_pair_id) {
> +        ret = vfio_migration_probe(&vdev->vbasedev, errp);

OK, and I can see vfio_migration_probe adds a blocker if the device
doesn't support migration.

> +        if (ret) {
> +            error_report("%s: Failed to setup for migration",
> +                         vdev->vbasedev.name);
> +        }
> +    }
> +
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
> @@ -3209,11 +3206,6 @@ out_teardown:
>      vfio_bars_exit(vdev);
>  error:
>      error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> -    if (vdev->migration_blocker) {
> -        migrate_del_blocker(vdev->migration_blocker);
> -        error_free(vdev->migration_blocker);
> -        vdev->migration_blocker = NULL;
> -    }
>  }
>  
>  static void vfio_instance_finalize(Object *obj)
> @@ -3225,10 +3217,7 @@ static void vfio_instance_finalize(Object *obj)
>      vfio_bars_finalize(vdev);
>      g_free(vdev->emulated_config_bits);
>      g_free(vdev->rom);
> -    if (vdev->migration_blocker) {
> -        migrate_del_blocker(vdev->migration_blocker);
> -        error_free(vdev->migration_blocker);
> -    }
> +
>      /*
>       * XXX Leaking igd_opregion is not an oversight, we can't remove the
>       * fw_cfg entry therefore leaking this allocation seems like the safest
> @@ -3256,6 +3245,7 @@ static void vfio_exitfn(PCIDevice *pdev)
>      }
>      vfio_teardown_msi(vdev);
>      vfio_bars_exit(vdev);
> +    vfio_migration_finalize(&vdev->vbasedev);

Which this removes at the end.
So,


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

>  }
>  
>  static void vfio_pci_reset(DeviceState *dev)
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 0da7a20a7ec2..b148c937ef72 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -168,7 +168,6 @@ typedef struct VFIOPCIDevice {
>      bool no_vfio_ioeventfd;
>      bool enable_ramfb;
>      VFIODisplay *dpy;
> -    Error *migration_blocker;
>      Notifier irqchip_change_notifier;
>  } VFIOPCIDevice;
>  
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 18/18] qapi: Add VFIO devices migration stats in Migration stats
  2020-05-20 18:24 ` [PATCH QEMU v23 18/18] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
@ 2020-05-21 19:23   ` Dr. David Alan Gilbert
  2020-05-25 14:34   ` Markus Armbruster
  1 sibling, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 19:23 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Added amount of bytes transferred to the target VM by all VFIO devices
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  hw/vfio/common.c            | 20 ++++++++++++++++++++
>  hw/vfio/migration.c         | 10 +++++++++-
>  include/qemu/vfio-helpers.h |  3 +++
>  migration/migration.c       | 18 ++++++++++++++++++
>  monitor/hmp-cmds.c          |  6 ++++++
>  qapi/migration.json         | 19 ++++++++++++++++++-
>  6 files changed, 74 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 2dd9e8b24788..e1ad426a6d78 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -40,6 +40,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> +#include "qemu/vfio-helpers.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -293,6 +294,25 @@ const MemoryRegionOps vfio_region_ops = {
>   * Device state interfaces
>   */
>  
> +bool vfio_mig_active(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    if (QLIST_EMPTY(&vfio_group_list)) {
> +        return false;
> +    }
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if (vbasedev->migration_blocker) {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
>  static bool vfio_devices_are_stopped_and_saving(void)
>  {
>      VFIOGroup *group;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 7729c90782bd..2eef38fe5c65 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -26,7 +26,7 @@
>  #include "exec/ram_addr.h"
>  #include "pci.h"
>  #include "trace.h"
> -
> +#include "qemu/vfio-helpers.h"
>  /*
>   * Flags used as delimiter:
>   * 0xffffffff => MSB 32-bit all 1s
> @@ -38,6 +38,8 @@
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>  
> +static int64_t bytes_transferred;
> +

eww global; but if there's no where else OK.
I think you need to initialise this somewhere at the start of each
migration;  you can get a migration that fails or is cancelled and then
start a second one, if you don't zero it then the second time will be
wrong.

>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -229,6 +231,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>          return ret;
>      }
>  
> +    bytes_transferred += data_size;
>      return data_size;
>  }
>  
> @@ -744,6 +747,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  
>  /* ---------------------------------------------------------------------- */
>  
> +int64_t vfio_mig_bytes_transferred(void)
> +{
> +    return bytes_transferred;
> +}
> +
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
>      struct vfio_region_info *info;
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 1f057c2b9e40..26a7df0767b1 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
>                             int irq_type, Error **errp);
>  
> +bool vfio_mig_active(void);
> +int64_t vfio_mig_bytes_transferred(void);
> +
>  #endif
> diff --git a/migration/migration.c b/migration/migration.c
> index 0bb042a0f78b..bcdc880e9a76 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -54,6 +54,7 @@
>  #include "net/announce.h"
>  #include "qemu/queue.h"
>  #include "multifd.h"
> +#include "qemu/vfio-helpers.h"
>  
>  #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
>  
> @@ -970,6 +971,17 @@ static void populate_disk_info(MigrationInfo *info)
>      }
>  }
>  
> +#ifdef CONFIG_LINUX

If you put the ifdef inside this function then you can avoid all the
others below.

> +static void populate_vfio_info(MigrationInfo *info)
> +{
> +    if (vfio_mig_active()) {
> +        info->has_vfio = true;
> +        info->vfio = g_malloc0(sizeof(*info->vfio));
> +        info->vfio->bytes = vfio_mig_bytes_transferred();
> +    }
> +}
> +#endif
> +
>  static void fill_source_migration_info(MigrationInfo *info)
>  {
>      MigrationState *s = migrate_get_current();
> @@ -995,6 +1007,9 @@ static void fill_source_migration_info(MigrationInfo *info)
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
>          populate_disk_info(info);
> +#ifdef CONFIG_LINUX
> +        populate_vfio_info(info);
> +#endif
>          break;
>      case MIGRATION_STATUS_COLO:
>          info->has_status = true;
> @@ -1003,6 +1018,9 @@ static void fill_source_migration_info(MigrationInfo *info)
>      case MIGRATION_STATUS_COMPLETED:
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
> +#ifdef CONFIG_LINUX
> +        populate_vfio_info(info);
> +#endif
>          break;
>      case MIGRATION_STATUS_FAILED:
>          info->has_status = true;
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index 9c61e769ca9a..396c195263c2 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -355,6 +355,12 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>          }
>          monitor_printf(mon, "]\n");
>      }
> +
> +    if (info->has_vfio) {
> +        monitor_printf(mon, "vfio device bytes: %" PRIu64 " kbytes\n",
> +                       info->vfio->bytes >> 10);
> +    }
> +
>      qapi_free_MigrationInfo(info);
>  }
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index d5000558c6c9..acbc42a1efe8 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -146,6 +146,18 @@
>              'active', 'postcopy-active', 'postcopy-paused',
>              'postcopy-recover', 'completed', 'failed', 'colo',
>              'pre-switchover', 'device', 'wait-unplug' ] }
> +##
> +# @VfioStats:
> +#
> +# Detailed VFIO devices migration statistics
> +#
> +# @bytes: amount of bytes transferred to the target VM by VFIO devices
> +#
> +# Since: 5.1
> +#
> +##
> +{ 'struct': 'VfioStats',
> +  'data': {'bytes': 'int' } }
>  
>  ##
>  # @MigrationInfo:
> @@ -207,6 +219,10 @@
>  #
>  # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
>  #
> +# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
> +#        only returned if VFIO device is present, migration is supported by all
> +#         VFIO devices and status is 'active' or 'completed' (since 5.1)
> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
> @@ -222,7 +238,8 @@
>             '*postcopy-blocktime' : 'uint32',
>             '*postcopy-vcpu-blocktime': ['uint32'],
>             '*compression': 'CompressionStats',
> -           '*socket-address': ['SocketAddress'] } }
> +           '*socket-address': ['SocketAddress'],
> +           '*vfio': 'VfioStats' } }

Is there anyway you can provide stats on the amount of main memory
that's been marked as dirty because it's mapped by the device?

Dave

>  ##
>  # @query-migrate:
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 04/18] vfio: Add save and load functions for VFIO PCI devices
  2020-05-21 12:12     ` Kirti Wankhede
@ 2020-05-21 19:28       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 19:28 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 5/21/2020 3:20 PM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > These functions save and restore PCI device specific data - config
> > > space of PCI device.
> > > Tested save and restore with MSI and MSIX type.
> > > 
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > 
> > So I'm OK with this from the migration side, but I'd like Alex to check it from
> > the PCI side of things.
> > 
> > I still think that you'd be better:
> >    a) Using a VMStateDescription to encode the structure
> 
> Can we use VMStateDescription and SaveVMHandlers at a same time for
> migration?

Yes, one trick you can do is to call vmstate_save_state from your save
function, passing it as a vmsd; e.g. hw/virtio/virtio-pci.c virtio_pci_save_extra_state
does that.

> >    b) or at least adding a flag at the end so you can add more data later
> > 
> 
> Sure, I'm thinking of this option.

Great.

> > Experience with every other device shows you're shooting yourself in the
> > foot by hard coding the layout and not giving yourself a chance to
> > expand it.
> > 
> > but for now,
> > 
> >
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Thanks, this version of QEMU patches was posted without addressing comments
> on previous version so that others can test kernel interface for now. We
> need to freeze kernel interface and get those patches in v5.8 kernel.

OK, I'm more on the qemu side; and I think I've reviewed all the patches
I understand (left a few deep vfio ones for Alex).  The kernel interface
looks like it fits qemu's use here OK.

> I'll revisit all the remaining comments on v16 and v18 QEMU patches again
> and reiterate the patches. So this is not the final version.

Great.

Dave

> Thanks,
> Kirti
> 
> > 
> > > ---
> > >   hw/vfio/pci.c                 | 163 ++++++++++++++++++++++++++++++++++++++++++
> > >   include/hw/vfio/vfio-common.h |   2 +
> > >   2 files changed, 165 insertions(+)
> > > 
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 0514ba373d1c..94535f0e27cd 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -41,6 +41,7 @@
> > >   #include "trace.h"
> > >   #include "qapi/error.h"
> > >   #include "migration/blocker.h"
> > > +#include "migration/qemu-file.h"
> > >   #define TYPE_VFIO_PCI "vfio-pci"
> > >   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> > > @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> > >       }
> > >   }
> > > +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> > > +{
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    VFIOBAR *bar = &vdev->bars[nr];
> > > +    uint64_t addr;
> > > +    uint32_t addr_lo, addr_hi = 0;
> > > +
> > > +    /* Skip unimplemented BARs and the upper half of 64bit BARS. */
> > > +    if (!bar->size) {
> > > +        return 0;
> > > +    }
> > > +
> > > +    addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 4, 4);
> > > +
> > > +    addr_lo &= (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> > > +                              PCI_BASE_ADDRESS_MEM_MASK);
> > > +    if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > > +        addr_hi = pci_default_read_config(pdev,
> > > +                                         PCI_BASE_ADDRESS_0 + (nr + 1) * 4, 4);
> > > +    }
> > > +
> > > +    addr = ((uint64_t)addr_hi << 32) | addr_lo;
> > > +
> > > +    if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> > > +{
> > > +    int i, ret;
> > > +
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        ret = vfio_bar_validate(vdev, i);
> > > +        if (ret) {
> > > +            error_report("vfio: BAR address %d validation failed", i);
> > > +            return ret;
> > > +        }
> > > +    }
> > > +    return 0;
> > > +}
> > > +
> > >   static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
> > >   {
> > >       VFIOBAR *bar = &vdev->bars[nr];
> > > @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> > >       return OBJECT(vdev);
> > >   }
> > > +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > > +{
> > > +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    uint16_t pci_cmd;
> > > +    int i;
> > > +
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        uint32_t bar;
> > > +
> > > +        bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > > +        qemu_put_be32(f, bar);
> > > +    }
> > > +
> > > +    qemu_put_be32(f, vdev->interrupt);
> > > +    if (vdev->interrupt == VFIO_INT_MSI) {
> > > +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > +        bool msi_64bit;
> > > +
> > > +        msi_flags = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +                                            2);
> > > +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > > +
> > > +        msi_addr_lo = pci_default_read_config(pdev,
> > > +                                         pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > > +        qemu_put_be32(f, msi_addr_lo);
> > > +
> > > +        if (msi_64bit) {
> > > +            msi_addr_hi = pci_default_read_config(pdev,
> > > +                                             pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > +                                             4);
> > > +        }
> > > +        qemu_put_be32(f, msi_addr_hi);
> > > +
> > > +        msi_data = pci_default_read_config(pdev,
> > > +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +                2);
> > > +        qemu_put_be16(f, msi_data);
> > > +    } else if (vdev->interrupt == VFIO_INT_MSIX) {
> > > +        uint16_t offset;
> > > +
> > > +        /* save enable bit and maskall bit */
> > > +        offset = pci_default_read_config(pdev,
> > > +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > > +        qemu_put_be16(f, offset);
> > > +        msix_save(pdev, f);
> > > +    }
> > > +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > +    qemu_put_be16(f, pci_cmd);
> > > +}
> > > +
> > > +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > > +{
> > > +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    uint32_t interrupt_type;
> > > +    uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > +    uint16_t pci_cmd;
> > > +    bool msi_64bit;
> > > +    int i, ret;
> > > +
> > > +    /* retore pci bar configuration */
> > > +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        uint32_t bar = qemu_get_be32(f);
> > > +
> > > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > > +    }
> > > +
> > > +    ret = vfio_bars_validate(vdev);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    interrupt_type = qemu_get_be32(f);
> > > +
> > > +    if (interrupt_type == VFIO_INT_MSI) {
> > > +        /* restore msi configuration */
> > > +        msi_flags = pci_default_read_config(pdev,
> > > +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > > +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > > +
> > > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> > > +
> > > +        msi_addr_lo = qemu_get_be32(f);
> > > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> > > +                              msi_addr_lo, 4);
> > > +
> > > +        msi_addr_hi = qemu_get_be32(f);
> > > +        if (msi_64bit) {
> > > +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > +                                  msi_addr_hi, 4);
> > > +        }
> > > +        msi_data = qemu_get_be16(f);
> > > +        vfio_pci_write_config(pdev,
> > > +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +                msi_data, 2);
> > > +
> > > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> > > +    } else if (interrupt_type == VFIO_INT_MSIX) {
> > > +        uint16_t offset = qemu_get_be16(f);
> > > +
> > > +        /* load enable bit and maskall bit */
> > > +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> > > +                              offset, 2);
> > > +        msix_load(pdev, f);
> > > +    }
> > > +    pci_cmd = qemu_get_be16(f);
> > > +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> > > +    return 0;
> > > +}
> > > +
> > >   static VFIODeviceOps vfio_pci_ops = {
> > >       .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
> > >       .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
> > >       .vfio_eoi = vfio_intx_eoi,
> > >       .vfio_get_object = vfio_pci_get_object,
> > > +    .vfio_save_config = vfio_pci_save_config,
> > > +    .vfio_load_config = vfio_pci_load_config,
> > >   };
> > >   int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > > index 74261feaeac9..d69a7f3ae31e 100644
> > > --- a/include/hw/vfio/vfio-common.h
> > > +++ b/include/hw/vfio/vfio-common.h
> > > @@ -120,6 +120,8 @@ struct VFIODeviceOps {
> > >       int (*vfio_hot_reset_multi)(VFIODevice *vdev);
> > >       void (*vfio_eoi)(VFIODevice *vdev);
> > >       Object *(*vfio_get_object)(VFIODevice *vdev);
> > > +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> > > +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
> > >   };
> > >   typedef struct VFIOGroup {
> > > -- 
> > > 2.7.0
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 08/18] vfio: Register SaveVMHandlers for VFIO device
  2020-05-21 18:00     ` Kirti Wankhede
@ 2020-05-21 19:35       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 39+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-21 19:35 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, eskultet, ziye.yang, armbru, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, yan.y.zhao, alex.williamson,
	changpeng.liu, quintela, zhi.a.wang, jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 5/21/2020 7:48 PM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > > Define flags to be used as delimeter in migration file stream.
> > > Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> > > region from these functions at source during saving or pre-copy phase.
> > > Set VFIO device state depending on VM's state. During live migration, VM is
> > > running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> > > device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> > > 
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>

<snip>

> > > +    register_savevm_live("vfio", VMSTATE_INSTANCE_ID_ANY, 1,
> > > +                         &savevm_vfio_handlers, vbasedev);
> > 
> > Hi,
> >    This is still the only bit which worries me, and I saw your note
> > saying you'd tested it; to calm my nerves, can you run with the
> > 'qemu_loadvm_state_section_startfull' trace enabled with 2 devices
> > and show me the output and qemu command line?
> > I'm trying to figure out how they end up represented in the stream.
> > 
> 
> Created mtty devices for source VM:
> echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1233" >
> /sys/class/mdev_bus/mtty/mdev_supported_types/mtty-2/create
> echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1234" >
> /sys/class/mdev_bus/mtty/mdev_supported_types/mtty-2/create
> 
> for destination VM:
> echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1235" >
> /sys/class/mdev_bus/mtty/mdev_supported_types/mtty-2/create
> echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1236" >
> /sys/class/mdev_bus/mtty/mdev_supported_types/mtty-2/create
> 
> Source qemu-cmdline:
> /usr/libexec/qemu-kvm \
>  -name guest=rhel75-mig,debug-threads=on \
>  -machine pc-i440fx-3.1,accel=kvm,usb=off,dump-guest-core=off \
>  -cpu SandyBridge,vme=on,hypervisor=on,arat=on,xsaveopt=on \
>  -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 \
>  -uuid eefb718c-137c-d416-e573-dd74ecd3490d \
>  -drive
> file=/home/vm/rhel-75.qcow2,format=qcow2,if=none,id=drive-ide0-0-0,cache=none
> \
>  -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1,write-cache=on
> \
>  -vnc 127.0.0.1:0 \
>  -device rtl8139,netdev=net0,mac=52:54:b2:88:86:2a,bus=pci.0,addr=0x3
> -netdev tap,id=net0,script=/root/qemu-ifup,downscript=no \
>  -device
> vfio-pci,sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1233
> \
>  -device
> vfio-pci,sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1234
> \
>  --trace events=/root/vfio_events \
>  -monitor unix:/tmp/qmp_socket1,server,nowait \
>  -serial stdio \
>  -msg timestamp=on
> 
> Destination qemu-cmdline:
> /usr/libexec/qemu-kvm \
>  -name guest=rhel75-mig,debug-threads=on \
>  -machine pc-i440fx-3.1,accel=kvm,usb=off,dump-guest-core=off \
>  -cpu SandyBridge,vme=on,hypervisor=on,arat=on,xsaveopt=on \
>  -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 \
>  -uuid eefb718c-137c-d416-e573-dd74ecd3490d \
>  -drive
> file=/home/vm/rhel-75.qcow2,format=qcow2,if=none,id=drive-ide0-0-0,cache=none
> \
>  -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1,write-cache=on
> \
>  -vnc 127.0.0.1:1 \
>  -device rtl8139,netdev=net0,mac=52:54:b2:88:86:2a,bus=pci.0,addr=0x3
> -netdev tap,id=net0,script=/root/qemu-ifup,downscript=no \
>  -device
> vfio-pci,sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1235
> \
>  -device
> vfio-pci,sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1236
> \
>  -incoming unix:/tmp/mig_socket \
>  --trace events=/root/vfio_events \
>  -monitor unix:/tmp/qmp_socket2,server,nowait \
>  -serial stdio \
>  -msg timestamp=on
> 
> Migrate:
> echo "migrate_set_speed 0" | sudo nc -U /tmp/qmp_socket1
> echo "migrate -d unix:/tmp/mig_socket" | sudo nc -U $/tmp/qmp_socket1
> 
> After migration, 'qemu_loadvm_state_section_startfull' traces:
> 
> qemu_loadvm_state_section_startfull 0.000 pid=1457 section_id=0x2
> idstr=b'ram' instance_id=0x0 version_id=0x4
> qemu_loadvm_state_section_startfull 515.606 pid=1457 section_id=0x2e
> idstr=b'vfio' instance_id=0x0 version_id=0x1
> qemu_loadvm_state_section_startfull 10.661 pid=1457 section_id=0x2f
> idstr=b'vfio' instance_id=0x1 version_id=0x1

Right, so this is my worry - we have two devices in the stream called
'vfio' with I think sequential id's - what makes each of your source
vfio devices go to the correct destination vfio device?  If the two
devices were different vfio devices, how would you ensure that they
ended up in the write place?  There's no requirement for
the order of the qemu command line on the source and the destination
to be the same, or for qemu to maintain semantics based on the order -
but I bet that's the ordering were getting here.

> idstr=b'0000:00:03.0/rtl8139' instance_id=0x0 version_id=0x5

Now you see that PCI NIC has a nice PCI address as it's name in the
stream; if you have two NICs defined then they end up getting loaded
into the destination device with the same guest PCI address - so it's
nice and repeatable (especially if you specify the PCI devices address
on the command line).

Dave



> qemu_loadvm_state_section_startfull 1619.840 pid=1457 section_id=0x30
> idstr=b'acpi_build' instance_id=0x0 version_id=0x1
> qemu_loadvm_state_section_startfull 4.200 pid=1457 section_id=0x31
> idstr=b'globalstate' instance_id=0x0 version_id=0x1
> 
> Thanks,
> Kirti
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 09/18] vfio: Add save state functions to SaveVMHandlers
  2020-05-20 18:24 ` [PATCH QEMU v23 09/18] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
  2020-05-21 15:37   ` Dr. David Alan Gilbert
@ 2020-05-21 20:43   ` Peter Xu
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Xu @ 2020-05-21 20:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Wed, May 20, 2020 at 11:54:39PM +0530, Kirti Wankhede wrote:
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes. If pending_bytes > 0, go through below steps.
> - read data_offset - indicates kernel driver to write data to staging
>   buffer.
> - read data_size - amount of data in bytes written by vendor driver in
>   migration region.
> - read data_size bytes of data from data_offset in the migration region.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }

Hi, Kirti,

(Sorry if this question is silly, since I didn't follow up with all these
 threads...)

Can I understand this commit as it only tracks dirty device BAR memory but not
guest system memory?  How did you track device writes to system memory when
without the vIOMMU?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH QEMU v23 18/18] qapi: Add VFIO devices migration stats in Migration stats
  2020-05-20 18:24 ` [PATCH QEMU v23 18/18] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
  2020-05-21 19:23   ` Dr. David Alan Gilbert
@ 2020-05-25 14:34   ` Markus Armbruster
  1 sibling, 0 replies; 39+ messages in thread
From: Markus Armbruster @ 2020-05-25 14:34 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic,
	felipe, Ken.Xue, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, quintela, zhi.a.wang,
	jonathan.davies, pbonzini

Kirti Wankhede <kwankhede@nvidia.com> writes:

> Added amount of bytes transferred to the target VM by all VFIO devices
>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>

QAPI review only.

> diff --git a/qapi/migration.json b/qapi/migration.json
> index d5000558c6c9..acbc42a1efe8 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -146,6 +146,18 @@
>              'active', 'postcopy-active', 'postcopy-paused',
>              'postcopy-recover', 'completed', 'failed', 'colo',
>              'pre-switchover', 'device', 'wait-unplug' ] }
> +##
> +# @VfioStats:
> +#
> +# Detailed VFIO devices migration statistics
> +#
> +# @bytes: amount of bytes transferred to the target VM by VFIO devices
> +#
> +# Since: 5.1
> +#
> +##
> +{ 'struct': 'VfioStats',
> +  'data': {'bytes': 'int' } }

Pardon my ignorance...  What exactly do VFIO devices transfer to the
target VM?  How is that related to MigrationInfo member @ram?

For what it's worth, MigrationStats has

   # @transferred: amount of bytes already transferred to the target VM

Name yours @transferred as well, for consistency?

MigrationStats has much more information, and some of it is pretty
useful to track how migration is doing, in particular whether it
converges, and how fast.  Absent in VfioStats due to "not implemented",
or due to "can't be done"?

Byte counts should use QAPI type 'size'.  Many existing ones don't.
Since MigrationStats uses 'int', I'll let the migration maintainers
decide whether they want 'int' or 'size' here.

>  
>  ##
>  # @MigrationInfo:
> @@ -207,6 +219,10 @@
>  #
>  # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
>  #
> +# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
> +#        only returned if VFIO device is present, migration is supported by all
> +#         VFIO devices and status is 'active' or 'completed' (since 5.1)
> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
> @@ -222,7 +238,8 @@
     'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
              '*disk': 'MigrationStats',
              '*xbzrle-cache': 'XBZRLECacheStats',
              '*total-time': 'int',
              '*expected-downtime': 'int',
              '*downtime': 'int',
              '*setup-time': 'int',
              '*cpu-throttle-percentage': 'int',
              '*error-desc': 'str',
>             '*postcopy-blocktime' : 'uint32',
>             '*postcopy-vcpu-blocktime': ['uint32'],
>             '*compression': 'CompressionStats',
> -           '*socket-address': ['SocketAddress'] } }
> +           '*socket-address': ['SocketAddress'],
> +           '*vfio': 'VfioStats' } }
>  
>  ##
>  # @query-migrate:

If @vfio and @ram are related, add @vfio next to @ram and @disk, please.



^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2020-05-25 14:35 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-20 18:24 [PATCH QEMU v23 00/18] Add migration support for VFIO devices Kirti Wankhede
2020-05-20 18:24 ` [PATCH QEMU v23 01/18] vfio: KABI for migration interface - Kernel header placeholder Kirti Wankhede
2020-05-20 18:24 ` [PATCH QEMU v23 02/18] vfio: Add function to unmap VFIO region Kirti Wankhede
2020-05-20 18:24 ` [PATCH QEMU v23 03/18] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
2020-05-20 18:24 ` [PATCH QEMU v23 04/18] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2020-05-21  9:50   ` Dr. David Alan Gilbert
2020-05-21 12:12     ` Kirti Wankhede
2020-05-21 19:28       ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 05/18] vfio: Add migration region initialization and finalize function Kirti Wankhede
2020-05-21  9:59   ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 06/18] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2020-05-21 11:19   ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 07/18] vfio: Add migration state change notifier Kirti Wankhede
2020-05-21 11:31   ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 08/18] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2020-05-21 14:18   ` Dr. David Alan Gilbert
2020-05-21 18:00     ` Kirti Wankhede
2020-05-21 19:35       ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 09/18] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2020-05-21 15:37   ` Dr. David Alan Gilbert
2020-05-21 20:43   ` Peter Xu
2020-05-20 18:24 ` [PATCH QEMU v23 10/18] vfio: Add load " Kirti Wankhede
2020-05-21 15:53   ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 11/18] iommu: add callback to get address limit IOMMU supports Kirti Wankhede
2020-05-21 16:13   ` Peter Xu
2020-05-20 18:24 ` [PATCH QEMU v23 12/18] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
2020-05-21 16:20   ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 13/18] vfio: Get migration capability flags for container Kirti Wankhede
2020-05-20 18:24 ` [PATCH QEMU v23 14/18] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
2020-05-21 16:50   ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 15/18] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
2020-05-21 18:52   ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 16/18] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
2020-05-21 19:05   ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 17/18] vfio: Make vfio-pci device migration capable Kirti Wankhede
2020-05-21 19:16   ` Dr. David Alan Gilbert
2020-05-20 18:24 ` [PATCH QEMU v23 18/18] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
2020-05-21 19:23   ` Dr. David Alan Gilbert
2020-05-25 14:34   ` Markus Armbruster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).