kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 Kernel 0/5] KABIs to support migration for VFIO devices
@ 2019-12-16 20:21 Kirti Wankhede
  2019-12-16 20:21 ` [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
                   ` (4 more replies)
  0 siblings, 5 replies; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-16 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Hi,

This patch set adds:
* New IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
  respect to IOMMU container rather than per device. All pages pinned by
  vendor driver through vfio_pin_pages external API has to be marked as
  dirty during  migration. When IOMMU capable device is present in the
  container and all pages are pinned and mapped, then all pages are marked
  dirty.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned should be considered
  as dirty.
  This ioctl is also used to start dirty pages tracking for unpinned pages
  while migration is active and device is running. These tracked unpinned
  pages information is cleaned on dirty bitmap read from VFIO application
  or if migration is failed or cancelled and unpinned pages tracking is
  stopped.
* Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap before
  unmapping IO virtual address range.
  With vIOMMU, during pre-copy phase of migration, while CPUs are still
  running, IO virtual address unmap can happen while device still keeping
  reference of guest pfns. Those pages should be reported as dirty before
  unmap, so that VFIO user space application can copy content of those
  pages from source to destination.

Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required such that vendor
driver could report dirty pages to VFIO module during migration phases.

If IOMMU capable device is present in the container, then all pages are
marked dirty. Need to think smart way to know if IOMMU capable device's
driver is smart to report pages to be marked dirty by pinning those pages
externally.

Adding revision history from previous QEMU patch set to understand KABI
changes done till now

v9 -> v10:
- Updated existing VFIO_IOMMU_UNMAP_DMA ioctl to get dirty pages bitmap
  during unmap while migration is active
- Added flag in VFIO_IOMMU_GET_INFO to indicate driver support dirty page
  tracking.
- If iommu_mapped, mark all pages dirty.
- Added unpinned pages tracking while migration is active.
- Updated comments for migration device state structure with bit
  combination table and state transition details.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load
  as suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of
  structure vfio_device_migration_info to be followed based on Alex's
  suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration,
  that is removed assumption that data will be till end of migration
  region.
- If vendor driver supports mappable sparsed region, map those region
  during setup state of save/load, similarly unmap those from cleanup
  routines.
- Handles race condition that causes data corruption in migration region
  during save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Split into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2
  bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined
  action on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with
  region type capability.
- Re-structured vfio_device_migration_info. This structure will be placed
  at 0th offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section
  of the region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested
  pages are copied.

Thanks,
Kirti



Kirti Wankhede (5):
  vfio: KABI for migration interface for device state
  vfio iommu: Adds flag to indicate dirty pages tracking capability
    support
  vfio iommu: Add ioctl defination for dirty pages tracking.
  vfio iommu: Implementation of ioctl to for dirty pages tracking.
  vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap

 drivers/vfio/vfio_iommu_type1.c | 276 +++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/vfio.h       | 240 +++++++++++++++++++++++++++++++++-
 2 files changed, 499 insertions(+), 17 deletions(-)

-- 
2.7.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-16 20:21 [PATCH v10 Kernel 0/5] KABIs to support migration for VFIO devices Kirti Wankhede
@ 2019-12-16 20:21 ` Kirti Wankhede
  2019-12-16 22:44   ` Alex Williamson
  2019-12-16 20:21 ` [PATCH v10 Kernel 2/5] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-16 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

- Defined MIGRATION region type and sub-type.

- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access.

- Defined device states and added state transition details in the comment.

- Added sequence to be followed while saving and resuming VFIO device state

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 180 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 180 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..a0817ba267c1 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
+#define VFIO_REGION_TYPE_MIGRATION              (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -379,6 +380,185 @@ struct vfio_region_gfx_edid {
 /* sub-types for VFIO_REGION_TYPE_CCW */
 #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
 
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
+
+/*
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      To indicate vendor driver the state VFIO device should be transitioned
+ *      to. If device state transition fails, write on this field return error.
+ *      It consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When its clear, that indicates
+ *        _STOP state. When device is changed to _STOP, driver should stop
+ *        device before write() returns.
+ *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
+ *        should start gathering device state information which will be provided
+ *        to VFIO user space application to save device's state.
+ *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
+ *        prepare to resume device, data provided through migration region
+ *        should be used to resume device.
+ *      Bits 3 - 31 are reserved for future use. User should perform
+ *      read-modify-write operation on this field.
+ *
+ *  +------- _RESUMING
+ *  |+------ _SAVING
+ *  ||+----- _RUNNING
+ *  |||
+ *  000b => Device Stopped, not saving or resuming
+ *  001b => Device running state, default state
+ *  010b => Stop Device & save device state, stop-and-copy state
+ *  011b => Device running and save device state, pre-copy state
+ *  100b => Device stopped and device state is resuming
+ *  101b => Invalid state
+ *  110b => Invalid state
+ *  111b => Invalid state
+ *
+ * State transitions:
+ *
+ *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
+ *                (100b)     (001b)     (011b)        (010b)       (000b)
+ * 0. Running or Default state
+ *                             |
+ *
+ * 1. Normal Shutdown
+ *                             |------------------------------------->|
+ *
+ * 2. Save state or Suspend
+ *                             |------------------------->|---------->|
+ *
+ * 3. Save state during live migration
+ *                             |----------->|------------>|---------->|
+ *
+ * 4. Resuming
+ *                  |<---------|
+ *
+ * 5. Resumed
+ *                  |--------->|
+ *
+ * 0. Default state of VFIO device is _RUNNNG when VFIO application starts.
+ * 1. During normal VFIO application shutdown, vfio device state changes
+ *    from _RUNNING to _STOP.
+ * 2. When VFIO application save state or suspend application, VFIO device
+ *    state transition is from _RUNNING to stop-and-copy state and then to
+ *    _STOP.
+ *    On state transition from _RUNNING to stop-and-copy, driver must
+ *    stop device, save device state and send it to application through
+ *    migration region.
+ *    On _RUNNING to stop-and-copy state transition failure, application should
+ *    set VFIO device state to _RUNNING.
+ * 3. In VFIO application live migration, state transition is from _RUNNING
+ *    to pre-copy to stop-and-copy to _STOP.
+ *    On state transition from _RUNNING to pre-copy, driver should start
+ *    gathering device state while application is still running and send device
+ *    state data to application through migration region.
+ *    On state transition from pre-copy to stop-and-copy, driver must stop
+ *    device, save device state and send it to application through migration
+ *    region.
+ *    On any failure during any of these state transition, VFIO device state
+ *    should be set to _RUNNING.
+ * 4. To start resuming phase, VFIO device state should be transitioned from
+ *    _RUNNING to _RESUMING state.
+ *    In _RESUMING state, driver should use received device state data through
+ *    migration region to resume device.
+ *    On failure during this state transition, application should set _RUNNING
+ *    state.
+ * 5. On providing saved device data to driver, appliation should change state
+ *    from _RESUMING to _RUNNING.
+ *    On failure to transition to _RUNNING state, VFIO application should reset
+ *    the device and set _RUNNING state so that device doesn't remain in unknown
+ *    or bad state. On reset, driver must reset device and device should be
+ *    available in default usable state.
+ *
+ * pending bytes: (read only)
+ *      Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read device data during _SAVING state or write
+ *      device data during _RESUMING state. See below for detail of sequence to
+ *      be followed.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to get size of data copied in
+ *      bytes in migration region during _SAVING state and write size of data
+ *      copied in bytes in migration region during _RESUMING state.
+ *
+ * Migration region looks like:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^
+ *  offset 0-trapped part        data_offset
+ *
+ * Structure vfio_device_migration_info is always followed by data section in
+ * the region, so data_offset will always be non-0. Offset from where data is
+ * copied is decided by kernel driver, data section can be trapped or mapped
+ * or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned. The user is not required to access via mmap regardless
+ * of the region mmap capabilities.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes, indicates start of new iteration to get device data.
+ *    If there was previous iteration, then this read operation indicates
+ *    previous iteration is done. If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates kernel driver to make data available through
+ *    data section. Kernel driver should return this read operation only after
+ *    data is available from (region + data_offset) to (region + data_offset +
+ *    data_size).
+ * c. read data_size, amount of data in bytes available through migration
+ *    region.
+ * d. read data of data_size bytes from (region + data_offset) from migration
+ *    region.
+ * e. process data.
+ * f. Loop through a to e.
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
+ * b. write data of data_size to migration region from data_offset.
+ * c. write data_size which indicates vendor driver that data is written in
+ *    staging buffer. Vendor driver should read this data from migration
+ *    region and resume device's state.
+ *
+ * For user application, data is opaque. User should write data in the same
+ * order as received.
+ */
+
+struct vfio_device_migration_info {
+	__u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_STOP      (1 << 0)
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+				     VFIO_DEVICE_STATE_SAVING |  \
+				     VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
+					    VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
+					    VFIO_DEVICE_STATE_RESUMING)
+	__u32 reserved;
+	__u64 pending_bytes;
+	__u64 data_offset;
+	__u64 data_size;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v10 Kernel 2/5] vfio iommu: Adds flag to indicate dirty pages tracking capability support
  2019-12-16 20:21 [PATCH v10 Kernel 0/5] KABIs to support migration for VFIO devices Kirti Wankhede
  2019-12-16 20:21 ` [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
@ 2019-12-16 20:21 ` Kirti Wankhede
  2019-12-16 23:16   ` Alex Williamson
  2019-12-16 20:21 ` [PATCH v10 Kernel 3/5] vfio iommu: Add ioctl defination for dirty pages tracking Kirti Wankhede
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-16 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Flag VFIO_IOMMU_INFO_DIRTY_PGS in VFIO_IOMMU_GET_INFO indicates that driver
support dirty pages tracking.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 3 ++-
 include/uapi/linux/vfio.h       | 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..3f6b04f2334f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2234,7 +2234,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 			info.cap_offset = 0; /* output, no-recopy necessary */
 		}
 
-		info.flags = VFIO_IOMMU_INFO_PGSIZES;
+		info.flags = VFIO_IOMMU_INFO_PGSIZES |
+			     VFIO_IOMMU_INFO_DIRTY_PGS;
 
 		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a0817ba267c1..81847ed54eb7 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -900,8 +900,9 @@ struct vfio_device_ioeventfd {
 struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
-#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
-#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
+#define VFIO_IOMMU_INFO_PGSIZES   (1 << 0) /* supported page sizes info */
+#define VFIO_IOMMU_INFO_CAPS      (1 << 1) /* Info supports caps */
+#define VFIO_IOMMU_INFO_DIRTY_PGS (1 << 2) /* supports dirty page tracking */
 	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
 };
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v10 Kernel 3/5] vfio iommu: Add ioctl defination for dirty pages tracking.
  2019-12-16 20:21 [PATCH v10 Kernel 0/5] KABIs to support migration for VFIO devices Kirti Wankhede
  2019-12-16 20:21 ` [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
  2019-12-16 20:21 ` [PATCH v10 Kernel 2/5] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
@ 2019-12-16 20:21 ` Kirti Wankhede
  2019-12-16 20:21 ` [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to " Kirti Wankhede
  2019-12-16 20:21 ` [PATCH v10 Kernel 5/5] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
  4 siblings, 0 replies; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-16 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

IOMMU container maintains a list of all pages pinned by vfio_pin_pages API.
All pages pinned by vendor driver through this API should be considered as
dirty during migration. When container consists of IOMMU capable device and
all pages are pinned and mapped, then all pages are marked dirty.
Added support to start/stop unpinned pages tracking and to get bitmap of
all dirtied pages for requested IO virtual address range. Unpinned page
tracking is cleared either when bitmap is read by user application or
unpinned page tracking is stopped.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 81847ed54eb7..4ad54fbb4698 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -975,6 +975,49 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_DIRTY_PAGES - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
+ *                                     struct vfio_iommu_type1_dirty_bitmap)
+ * IOCTL is used for dirty pages tracking. Caller sets argsz, which is size of
+ * struct vfio_iommu_type1_dirty_bitmap. Caller set flag depend on which
+ * operation to perform, details as below:
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_START set, indicates
+ * migration is active and IOMMU module should track pages which are being
+ * unpinned. Unpinned pages are tracked until bitmap for that range is queried
+ * or tracking is stopped by user application by setting
+ * VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP flag.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP set, indicates
+ * IOMMU should stop tracking unpinned pages and also free previously tracked
+ * unpinned pages data.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP flag set,
+ * IOCTL returns dirty pages bitmap for IOMMU container during migration for
+ * given IOVA range. User must allocate memory to get bitmap, zero the bitmap
+ * memory and set size of allocated memory in bitmap_size field. One bit is
+ * used to represent one page consecutively starting from iova offset. User
+ * should provide page size in 'pgsize'. Bit set in bitmap indicates page at
+ * that offset from iova is dirty.
+ *
+ * Only one flag should be set at a time.
+ *
+ */
+struct vfio_iommu_type1_dirty_bitmap {
+	__u32        argsz;
+	__u32        flags;
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_START	(1 << 0)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP	(1 << 1)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP	(1 << 2)
+	__u64        iova;		/* IO virtual address */
+	__u64        size;		/* Size of iova range */
+	__u64	     pgsize;		/* page size for bitmap */
+	__u64        bitmap_size;	/* in bytes */
+	void __user *bitmap;		/* one bit per page */
+};
+
+#define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-16 20:21 [PATCH v10 Kernel 0/5] KABIs to support migration for VFIO devices Kirti Wankhede
                   ` (2 preceding siblings ...)
  2019-12-16 20:21 ` [PATCH v10 Kernel 3/5] vfio iommu: Add ioctl defination for dirty pages tracking Kirti Wankhede
@ 2019-12-16 20:21 ` Kirti Wankhede
  2019-12-17  5:15   ` Yan Zhao
  2019-12-16 20:21 ` [PATCH v10 Kernel 5/5] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
  4 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-16 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
- Start unpinned pages dirty pages tracking while migration is active and
  device is running, i.e. during pre-copy phase.
- Stop unpinned pages dirty pages tracking. This is required to stop
  unpinned dirty pages tracking if migration failed or cancelled during
  pre-copy phase. Unpinned pages tracking is clear.
- Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
  unpinned pages information on bitmap read. This ioctl returns bitmap of
  dirty pages, its user space application responsibility to copy content
  of dirty pages from source to destination during migration.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 210 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 203 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3f6b04f2334f..264449654d3f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -70,6 +70,7 @@ struct vfio_iommu {
 	unsigned int		dma_avail;
 	bool			v2;
 	bool			nesting;
+	bool			dirty_page_tracking;
 };
 
 struct vfio_domain {
@@ -112,6 +113,7 @@ struct vfio_pfn {
 	dma_addr_t		iova;		/* Device address */
 	unsigned long		pfn;		/* Host pfn */
 	atomic_t		ref_count;
+	bool			unpinned;
 };
 
 struct vfio_regions {
@@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
 	kfree(vpfn);
 }
 
+static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
+{
+	struct rb_node *n = rb_first(&dma->pfn_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
+
+		if (warn)
+			WARN_ON_ONCE(vpfn->unpinned);
+
+		if (vpfn->unpinned)
+			vfio_remove_from_pfn_list(dma, vpfn);
+	}
+}
+
+static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
+{
+	struct rb_node *n = rb_first(&iommu->dma_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		vfio_remove_unpinned_from_pfn_list(dma, false);
+	}
+}
+
 static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
 					       unsigned long iova)
 {
@@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
 	return vpfn;
 }
 
-static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
+static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
+				  bool dirty_tracking)
 {
 	int ret = 0;
 
 	if (atomic_dec_and_test(&vpfn->ref_count)) {
 		ret = put_pfn(vpfn->pfn, dma->prot);
-		vfio_remove_from_pfn_list(dma, vpfn);
+		if (dirty_tracking)
+			vpfn->unpinned = true;
+		else
+			vfio_remove_from_pfn_list(dma, vpfn);
 	}
 	return ret;
 }
@@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 }
 
 static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
-				    bool do_accounting)
+				    bool do_accounting, bool dirty_tracking)
 {
 	int unlocked;
 	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
@@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
 	if (!vpfn)
 		return 0;
 
-	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
+	if (vpfn->unpinned)
+		return 0;
+
+	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
 
 	if (do_accounting)
 		vfio_lock_acct(dma, -unlocked, true);
@@ -583,7 +618,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
 		if (ret) {
-			vfio_unpin_page_external(dma, iova, do_accounting);
+			vfio_unpin_page_external(dma, iova, do_accounting,
+						 false);
 			goto pin_unwind;
 		}
 	}
@@ -598,7 +634,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 		iova = user_pfn[j] << PAGE_SHIFT;
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
-		vfio_unpin_page_external(dma, iova, do_accounting);
+		vfio_unpin_page_external(dma, iova, do_accounting, false);
 		phys_pfn[j] = 0;
 	}
 pin_done:
@@ -632,7 +668,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
 		if (!dma)
 			goto unpin_exit;
-		vfio_unpin_page_external(dma, iova, do_accounting);
+		vfio_unpin_page_external(dma, iova, do_accounting,
+					 iommu->dirty_page_tracking);
 	}
 
 unpin_exit:
@@ -850,6 +887,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 	return bitmap;
 }
 
+/*
+ * start_iova is the reference from where bitmaping started. This is called
+ * from DMA_UNMAP where start_iova can be different than iova
+ */
+
+static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
+				  size_t size, uint64_t pgsize,
+				  dma_addr_t start_iova, unsigned long *bitmap)
+{
+	struct vfio_dma *dma;
+	dma_addr_t i = iova;
+	unsigned long pgshift = __ffs(pgsize);
+
+	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
+		/* mark all pages dirty if all pages are pinned and mapped. */
+		if (dma->iommu_mapped) {
+			dma_addr_t iova_limit;
+
+			iova_limit = (dma->iova + dma->size) < (iova + size) ?
+				     (dma->iova + dma->size) : (iova + size);
+
+			for (; i < iova_limit; i += pgsize) {
+				unsigned int start;
+
+				start = (i - start_iova) >> pgshift;
+
+				__bitmap_set(bitmap, start, 1);
+			}
+			if (i >= iova + size)
+				return;
+		} else {
+			struct rb_node *n = rb_first(&dma->pfn_list);
+			bool found = false;
+
+			for (; n; n = rb_next(n)) {
+				struct vfio_pfn *vpfn = rb_entry(n,
+							struct vfio_pfn, node);
+				if (vpfn->iova >= i) {
+					found = true;
+					break;
+				}
+			}
+
+			if (!found) {
+				i += dma->size;
+				continue;
+			}
+
+			for (; n; n = rb_next(n)) {
+				unsigned int start;
+				struct vfio_pfn *vpfn = rb_entry(n,
+							struct vfio_pfn, node);
+
+				if (vpfn->iova >= iova + size)
+					return;
+
+				start = (vpfn->iova - start_iova) >> pgshift;
+
+				__bitmap_set(bitmap, start, 1);
+
+				i = vpfn->iova + pgsize;
+			}
+		}
+		vfio_remove_unpinned_from_pfn_list(dma, false);
+	}
+}
+
+static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
+{
+	long bsize;
+
+	if (!bitmap_size || bitmap_size > SIZE_MAX)
+		return -EINVAL;
+
+	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
+
+	if (bitmap_size < bsize)
+		return -EINVAL;
+
+	return bsize;
+}
+
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 			     struct vfio_iommu_type1_dma_unmap *unmap)
 {
@@ -2298,6 +2417,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
+		struct vfio_iommu_type1_dirty_bitmap range;
+		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
+				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
+				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+		int ret;
+
+		if (!iommu->v2)
+			return -EACCES;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
+				    bitmap);
+
+		if (copy_from_user(&range, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (range.argsz < minsz || range.flags & ~mask)
+			return -EINVAL;
+
+		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
+			iommu->dirty_page_tracking = true;
+			return 0;
+		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
+			iommu->dirty_page_tracking = false;
+
+			mutex_lock(&iommu->lock);
+			vfio_remove_unpinned_from_dma_list(iommu);
+			mutex_unlock(&iommu->lock);
+			return 0;
+
+		} else if (range.flags &
+				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
+			uint64_t iommu_pgmask;
+			unsigned long pgshift = __ffs(range.pgsize);
+			unsigned long *bitmap;
+			long bsize;
+
+			iommu_pgmask =
+			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
+
+			if (((range.pgsize - 1) & iommu_pgmask) !=
+			    (range.pgsize - 1))
+				return -EINVAL;
+
+			if (range.iova & iommu_pgmask)
+				return -EINVAL;
+			if (!range.size || range.size > SIZE_MAX)
+				return -EINVAL;
+			if (range.iova + range.size < range.iova)
+				return -EINVAL;
+
+			bsize = verify_bitmap_size(range.size >> pgshift,
+						   range.bitmap_size);
+			if (bsize)
+				return ret;
+
+			bitmap = kmalloc(bsize, GFP_KERNEL);
+			if (!bitmap)
+				return -ENOMEM;
+
+			ret = copy_from_user(bitmap,
+			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
+			if (ret)
+				goto bitmap_exit;
+
+			iommu->dirty_page_tracking = false;
+			mutex_lock(&iommu->lock);
+			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
+					     range.pgsize, range.iova, bitmap);
+			mutex_unlock(&iommu->lock);
+
+			ret = copy_to_user((void __user *)range.bitmap, bitmap,
+					   range.bitmap_size) ? -EFAULT : 0;
+bitmap_exit:
+			kfree(bitmap);
+			return ret;
+		}
 	}
 
 	return -ENOTTY;
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v10 Kernel 5/5] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
  2019-12-16 20:21 [PATCH v10 Kernel 0/5] KABIs to support migration for VFIO devices Kirti Wankhede
                   ` (3 preceding siblings ...)
  2019-12-16 20:21 ` [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to " Kirti Wankhede
@ 2019-12-16 20:21 ` Kirti Wankhede
  4 siblings, 0 replies; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-16 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Pages, pinned by external interface for requested IO virtual address
range,  might get unpinned  and unmapped while migration is active and
device is still running, that is, in pre-copy phase while guest driver
still could access those pages. Host device can write to these pages while
those were mapped. Such pages should be marked dirty so that after
migration guest driver should still be able to complete the operation.

To get bitmap during unmap, user should set flag
VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
zeroed by user space application. Bitmap size and page size should be set
by user application.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 63 ++++++++++++++++++++++++++++++++++++-----
 include/uapi/linux/vfio.h       | 12 ++++++++
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 264449654d3f..6bd02a13903b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -970,7 +970,8 @@ static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
 }
 
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
-			     struct vfio_iommu_type1_dma_unmap *unmap)
+			     struct vfio_iommu_type1_dma_unmap *unmap,
+			     unsigned long *bitmap)
 {
 	uint64_t mask;
 	struct vfio_dma *dma, *dma_last = NULL;
@@ -1045,6 +1046,15 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (dma->task->mm != current->mm)
 			break;
 
+		if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
+		    (dma_last != dma))
+			vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size,
+					     unmap->bitmap_pgsize, unmap->iova,
+					     bitmap);
+		else
+			vfio_remove_unpinned_from_pfn_list(dma, true);
+
+
 		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
 			struct vfio_iommu_type1_dma_unmap nb_unmap;
 
@@ -1070,6 +1080,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 						    &nb_unmap);
 			goto again;
 		}
+
 		unmapped += dma->size;
 		vfio_remove_dma(iommu, dma);
 	}
@@ -2401,22 +2412,60 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
 		struct vfio_iommu_type1_dma_unmap unmap;
-		long ret;
+		unsigned long *bitmap = NULL;
+		long ret, bsize;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
 
-		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+		if (copy_from_user(&unmap, (void __user *)arg, sizeof(unmap)))
 			return -EFAULT;
 
-		if (unmap.argsz < minsz || unmap.flags)
+		if (unmap.argsz < minsz ||
+		    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
 			return -EINVAL;
 
-		ret = vfio_dma_do_unmap(iommu, &unmap);
+		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+			unsigned long pgshift = __ffs(unmap.bitmap_pgsize);
+			uint64_t iommu_pgmask =
+			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
+
+			if (((unmap.bitmap_pgsize - 1) & iommu_pgmask) !=
+			     (unmap.bitmap_pgsize - 1))
+				return -EINVAL;
+
+			bsize = verify_bitmap_size(unmap.size >> pgshift,
+						   unmap.bitmap_size);
+			if (bsize < 0)
+				return bsize;
+
+			bitmap = kmalloc(bsize, GFP_KERNEL);
+			if (!bitmap)
+				return -ENOMEM;
+
+			if (copy_from_user(bitmap, (void __user *)unmap.bitmap,
+					   bsize)) {
+				ret = -EFAULT;
+				goto unmap_exit;
+			}
+		}
+
+		ret = vfio_dma_do_unmap(iommu, &unmap, bitmap);
 		if (ret)
-			return ret;
+			goto unmap_exit;
 
-		return copy_to_user((void __user *)arg, &unmap, minsz) ?
+		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+			if (copy_to_user((void __user *)unmap.bitmap, bitmap,
+					  bsize)) {
+				ret = -EFAULT;
+				goto unmap_exit;
+			}
+		}
+
+		ret = copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+unmap_exit:
+		kfree(bitmap);
+		return ret;
 	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
 		struct vfio_iommu_type1_dirty_bitmap range;
 		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 4ad54fbb4698..7705aea7bdaf 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -958,12 +958,24 @@ struct vfio_iommu_type1_dma_map {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
+ * before unmapping IO virtual addresses. When this flag is set, user should
+ * allocate memory to get bitmap, clear the bitmap memory by setting zero and
+ * should set size of allocated memory in bitmap_size field. One bit in bitmap
+ * represents per page , page of user provided page size in 'bitmap_pgsize',
+ * consecutively starting from iova offset. Bit set indicates page at that
+ * offset from iova is dirty. Bitmap of pages in the range of unmapped size is
+ * returned in bitmap.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
+	__u64        bitmap_pgsize;		/* page size for bitmap */
+	__u64        bitmap_size;               /* in bytes */
+	void __user *bitmap;                    /* one bit per page */
 };
 
 #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-16 20:21 ` [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
@ 2019-12-16 22:44   ` Alex Williamson
  2019-12-17  6:28     ` Kirti Wankhede
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2019-12-16 22:44 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 17 Dec 2019 01:51:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> 
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access.
> 
> - Defined device states and added state transition details in the comment.
> 
> - Added sequence to be followed while saving and resuming VFIO device state
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 180 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 180 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a147ead..a0817ba267c1 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_TYPE_CCW			(2)
> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>  
>  /* sub-types for VFIO_REGION_TYPE_PCI_* */
>  
> @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid {
>  /* sub-types for VFIO_REGION_TYPE_CCW */
>  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>  
> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> +
> +/*
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      To indicate vendor driver the state VFIO device should be transitioned
> + *      to. If device state transition fails, write on this field return error.
> + *      It consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When its clear, that indicates

s/its/it's/

> + *        _STOP state. When device is changed to _STOP, driver should stop
> + *        device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> + *        should start gathering device state information which will be provided
> + *        to VFIO user space application to save device's state.
> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> + *        prepare to resume device, data provided through migration region
> + *        should be used to resume device.
> + *      Bits 3 - 31 are reserved for future use. User should perform
> + *      read-modify-write operation on this field.
> + *
> + *  +------- _RESUMING
> + *  |+------ _SAVING
> + *  ||+----- _RUNNING
> + *  |||
> + *  000b => Device Stopped, not saving or resuming
> + *  001b => Device running state, default state
> + *  010b => Stop Device & save device state, stop-and-copy state
> + *  011b => Device running and save device state, pre-copy state
> + *  100b => Device stopped and device state is resuming
> + *  101b => Invalid state

Eventually this would be intended for post-copy, if supported by the
device, right?

> + *  110b => Invalid state
> + *  111b => Invalid state
> + *
> + * State transitions:
> + *
> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> + *                (100b)     (001b)     (011b)        (010b)       (000b)
> + * 0. Running or Default state
> + *                             |
> + *
> + * 1. Normal Shutdown

Optional, userspace is under no obligation.

> + *                             |------------------------------------->|
> + *
> + * 2. Save state or Suspend
> + *                             |------------------------->|---------->|
> + *
> + * 3. Save state during live migration
> + *                             |----------->|------------>|---------->|
> + *
> + * 4. Resuming
> + *                  |<---------|
> + *
> + * 5. Resumed
> + *                  |--------->|
> + *
> + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts.
> + * 1. During normal VFIO application shutdown, vfio device state changes
> + *    from _RUNNING to _STOP.

We cannot impose this requirement on existing userspace.  Userspace may
perform this action, but they are not required to and the vendor driver
must not require it.

> + * 2. When VFIO application save state or suspend application, VFIO device
> + *    state transition is from _RUNNING to stop-and-copy state and then to
> + *    _STOP.
> + *    On state transition from _RUNNING to stop-and-copy, driver must
> + *    stop device, save device state and send it to application through
> + *    migration region.
> + *    On _RUNNING to stop-and-copy state transition failure, application should
> + *    set VFIO device state to _RUNNING.

A state transition failure means that the user's write to device_state
failed, so is it the user's responsibility to set the next state?  Why
is it necessarily _RUNNING vs _STOP?

> + * 3. In VFIO application live migration, state transition is from _RUNNING
> + *    to pre-copy to stop-and-copy to _STOP.
> + *    On state transition from _RUNNING to pre-copy, driver should start
> + *    gathering device state while application is still running and send device
> + *    state data to application through migration region.
> + *    On state transition from pre-copy to stop-and-copy, driver must stop
> + *    device, save device state and send it to application through migration
> + *    region.
> + *    On any failure during any of these state transition, VFIO device state
> + *    should be set to _RUNNING.

Same comment as above regarding next state on failure.

Also, it seems like it's the vendor driver's discretion to actually
provide data during the pre-copy phase.  As we've defined it, the
vendor driver needs to participate in the migration region regardless,
they might just always report no pending_bytes until we enter
stop-and-copy.

> + * 4. To start resuming phase, VFIO device state should be transitioned from
> + *    _RUNNING to _RESUMING state.
> + *    In _RESUMING state, driver should use received device state data through
> + *    migration region to resume device.
> + *    On failure during this state transition, application should set _RUNNING
> + *    state.

Same comment regarding setting next state after failure.

> + * 5. On providing saved device data to driver, appliation should change state
> + *    from _RESUMING to _RUNNING.
> + *    On failure to transition to _RUNNING state, VFIO application should reset
> + *    the device and set _RUNNING state so that device doesn't remain in unknown
> + *    or bad state. On reset, driver must reset device and device should be
> + *    available in default usable state.

Didn't we discuss that the reset ioctl should return the device to the
initial state, including the transition to _RUNNING?  Also, as above,
it's the user write that triggers the failure, this register is listed
as read-write, so what value does the vendor driver report for the
state when read after a transition failure?  Is it reported as _RESUMING
as it was prior to the attempted transition, or may the invalid states
be used by the vendor driver to indicate the device is broken?

> + *
> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state. See below for detail of sequence to
> + *      be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      bytes in migration region during _SAVING state and write size of data
> + *      copied in bytes in migration region during _RESUMING state.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^
> + *  offset 0-trapped part        data_offset
> + *
> + * Structure vfio_device_migration_info is always followed by data section in
> + * the region, so data_offset will always be non-0. Offset from where data is
> + * copied is decided by kernel driver, data section can be trapped or mapped
> + * or partitioned, depending on how kernel driver defines data section.
> + * Data section partition can be defined as mapped by sparse mmap capability.
> + * If mmapped, then data_offset should be page aligned, where as initial section
> + * which contain vfio_device_migration_info structure might not end at offset
> + * which is page aligned. The user is not required to access via mmap regardless
> + * of the region mmap capabilities.
> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.
> + *
> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> + * and for _SAVING device state or stop-and-copy phase:
> + * a. read pending_bytes, indicates start of new iteration to get device data.
> + *    If there was previous iteration, then this read operation indicates
> + *    previous iteration is done. If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates kernel driver to make data available through
> + *    data section. Kernel driver should return this read operation only after
> + *    data is available from (region + data_offset) to (region + data_offset +
> + *    data_size).
> + * c. read data_size, amount of data in bytes available through migration
> + *    region.
> + * d. read data of data_size bytes from (region + data_offset) from migration
> + *    region.
> + * e. process data.
> + * f. Loop through a to e.

It seems we always need to end an iteration by reading pending_bytes to
signal to the vendor driver to release resources, so should the end of
the loop be:

e. Read pending_bytes
f. Goto b. or optionally restart next iteration at a.

I think this is defined such that reading data_offset commits resources
and reading pending_bytes frees them, allowing userspace to restart at
reading pending_bytes with no side-effects.  Therefore reading
pending_bytes repeatedly is supported.  Is the same true for
data_offset and data_size?  It seems reasonable that the vendor driver
can simply return offset and size for the current buffer if the user
reads these more than once.

How is a protocol or device error signaled?  For example, we can have a
user error where they read data_size before data_offset.  Should the
vendor driver generate a fault reading data_size in this case.  We can
also have internal errors in the vendor driver, should the vendor
driver use a special errno or update device_state autonomously to
indicate such an error?

I believe it's also part of the intended protocol that the user can
transition from _SAVING|_RUNNING to _SAVING at any point, regardless of
pending_bytes.  This should be noted.

> + *
> + * Sequence to be followed while _RESUMING device state:
> + * While data for this device is available, repeat below steps:
> + * a. read data_offset from where user application should write data.
> + * b. write data of data_size to migration region from data_offset.

Whose's data_size, the _SAVING end or the _RESUMING end?  I think this
is intended to be the transaction size from the _SAVING source, but it
could easily be misinterpreted as reading data_size on the _RESUMING
end.

> + * c. write data_size which indicates vendor driver that data is written in
> + *    staging buffer. Vendor driver should read this data from migration
> + *    region and resume device's state.

I think we also need to define the error protocol.  The user could
mis-order transactions or there could be an internal error in the
vendor driver or device.  Are all read(2)/write(2) operations
susceptible to defined errnos to signal this?  Is it reflected in
device_state?  What's the recovery protocol?

> + *
> + * For user application, data is opaque. User should write data in the same
> + * order as received.

Order and transaction size, ie. each data_size chunk is indivisible by
the user.

> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_STOP      (1 << 0)
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)

Huh?  We should probably just refer to it consistently, ie. _RUNNING
and !_RUNNING, otherwise we have the incongruity that setting the _STOP
value is actually the opposite of the necessary logic value (_STOP = 1
is _RUNNING, _STOP = 0 is !_RUNNING).

> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +				     VFIO_DEVICE_STATE_SAVING |  \
> +				     VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
> +					    VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
> +					    VFIO_DEVICE_STATE_RESUMING)

Gack, we fixed these in the last iteration!

> +	__u32 reserved;
> +	__u64 pending_bytes;
> +	__u64 data_offset;
> +	__u64 data_size;
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within

Thanks,
Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 2/5] vfio iommu: Adds flag to indicate dirty pages tracking capability support
  2019-12-16 20:21 ` [PATCH v10 Kernel 2/5] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
@ 2019-12-16 23:16   ` Alex Williamson
  2019-12-17  6:32     ` Kirti Wankhede
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2019-12-16 23:16 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 17 Dec 2019 01:51:37 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Flag VFIO_IOMMU_INFO_DIRTY_PGS in VFIO_IOMMU_GET_INFO indicates that driver
> support dirty pages tracking.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 3 ++-
>  include/uapi/linux/vfio.h       | 5 +++--
>  2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ada8e6cdb88..3f6b04f2334f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,7 +2234,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  			info.cap_offset = 0; /* output, no-recopy necessary */
>  		}
>  
> -		info.flags = VFIO_IOMMU_INFO_PGSIZES;
> +		info.flags = VFIO_IOMMU_INFO_PGSIZES |
> +			     VFIO_IOMMU_INFO_DIRTY_PGS;

Type1 shouldn't advertise it until it's supported though, right?
Thanks,

Alex

>  
>  		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index a0817ba267c1..81847ed54eb7 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -900,8 +900,9 @@ struct vfio_device_ioeventfd {
>  struct vfio_iommu_type1_info {
>  	__u32	argsz;
>  	__u32	flags;
> -#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
> -#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
> +#define VFIO_IOMMU_INFO_PGSIZES   (1 << 0) /* supported page sizes info */
> +#define VFIO_IOMMU_INFO_CAPS      (1 << 1) /* Info supports caps */
> +#define VFIO_IOMMU_INFO_DIRTY_PGS (1 << 2) /* supports dirty page tracking */
>  	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
>  	__u32   cap_offset;	/* Offset within info struct of first cap */
>  };


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-16 20:21 ` [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to " Kirti Wankhede
@ 2019-12-17  5:15   ` Yan Zhao
  2019-12-17  9:24     ` Kirti Wankhede
  0 siblings, 1 reply; 44+ messages in thread
From: Yan Zhao @ 2019-12-17  5:15 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:
> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> - Start unpinned pages dirty pages tracking while migration is active and
>   device is running, i.e. during pre-copy phase.
> - Stop unpinned pages dirty pages tracking. This is required to stop
>   unpinned dirty pages tracking if migration failed or cancelled during
>   pre-copy phase. Unpinned pages tracking is clear.
> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
>   unpinned pages information on bitmap read. This ioctl returns bitmap of
>   dirty pages, its user space application responsibility to copy content
>   of dirty pages from source to destination during migration.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 210 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 203 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 3f6b04f2334f..264449654d3f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
>  	unsigned int		dma_avail;
>  	bool			v2;
>  	bool			nesting;
> +	bool			dirty_page_tracking;
>  };
>  
>  struct vfio_domain {
> @@ -112,6 +113,7 @@ struct vfio_pfn {
>  	dma_addr_t		iova;		/* Device address */
>  	unsigned long		pfn;		/* Host pfn */
>  	atomic_t		ref_count;
> +	bool			unpinned;
>  };
>  
>  struct vfio_regions {
> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>  	kfree(vpfn);
>  }
>  
> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
> +{
> +	struct rb_node *n = rb_first(&dma->pfn_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> +
> +		if (warn)
> +			WARN_ON_ONCE(vpfn->unpinned);
> +
> +		if (vpfn->unpinned)
> +			vfio_remove_from_pfn_list(dma, vpfn);
> +	}
> +}
> +
> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> +	}
> +}
> +
>  static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>  					       unsigned long iova)
>  {
> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>  	return vpfn;
>  }
>  
> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
> +				  bool dirty_tracking)
>  {
>  	int ret = 0;
>  
>  	if (atomic_dec_and_test(&vpfn->ref_count)) {
>  		ret = put_pfn(vpfn->pfn, dma->prot);
if physical page here is put, it may cause problem when pin this iova
next time:
vfio_iommu_type1_pin_pages {
    ...
    vpfn = vfio_iova_get_vfio_pfn(dma, iova);
    if (vpfn) {
        phys_pfn[i] = vpfn->pfn;
        continue;
    }
    ...
}

> -		vfio_remove_from_pfn_list(dma, vpfn);
> +		if (dirty_tracking)
> +			vpfn->unpinned = true;
> +		else
> +			vfio_remove_from_pfn_list(dma, vpfn);
so the unpinned pages before dirty page tracking is not treated as
dirty?

>  	}
>  	return ret;
>  }
> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>  }
>  
>  static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> -				    bool do_accounting)
> +				    bool do_accounting, bool dirty_tracking)
>  {
>  	int unlocked;
>  	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>  	if (!vpfn)
>  		return 0;
>  
> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> +	if (vpfn->unpinned)
> +		return 0;
> +
> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
>  
>  	if (do_accounting)
>  		vfio_lock_acct(dma, -unlocked, true);
> @@ -583,7 +618,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>  		if (ret) {
> -			vfio_unpin_page_external(dma, iova, do_accounting);
> +			vfio_unpin_page_external(dma, iova, do_accounting,
> +						 false);
>  			goto pin_unwind;
>  		}
>  	}
> @@ -598,7 +634,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		iova = user_pfn[j] << PAGE_SHIFT;
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
>  		phys_pfn[j] = 0;
>  	}
>  pin_done:
> @@ -632,7 +668,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>  		if (!dma)
>  			goto unpin_exit;
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(dma, iova, do_accounting,
> +					 iommu->dirty_page_tracking);
>  	}
>  
>  unpin_exit:
> @@ -850,6 +887,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	return bitmap;
>  }
>  
> +/*
> + * start_iova is the reference from where bitmaping started. This is called
> + * from DMA_UNMAP where start_iova can be different than iova
> + */
> +
> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> +				  size_t size, uint64_t pgsize,
> +				  dma_addr_t start_iova, unsigned long *bitmap)
> +{
> +	struct vfio_dma *dma;
> +	dma_addr_t i = iova;
> +	unsigned long pgshift = __ffs(pgsize);
> +
> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> +		/* mark all pages dirty if all pages are pinned and mapped. */
> +		if (dma->iommu_mapped) {
This prevents pass-through devices from calling vfio_pin_pages to do
fine grained log dirty.
> +			dma_addr_t iova_limit;
> +
> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
> +				     (dma->iova + dma->size) : (iova + size);
> +
> +			for (; i < iova_limit; i += pgsize) {
> +				unsigned int start;
> +
> +				start = (i - start_iova) >> pgshift;
> +
> +				__bitmap_set(bitmap, start, 1);
> +			}
> +			if (i >= iova + size)
> +				return;
> +		} else {
> +			struct rb_node *n = rb_first(&dma->pfn_list);
> +			bool found = false;
> +
> +			for (; n; n = rb_next(n)) {
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +							struct vfio_pfn, node);
> +				if (vpfn->iova >= i) {
> +					found = true;
> +					break;
> +				}
> +			}
> +
> +			if (!found) {
> +				i += dma->size;
> +				continue;
> +			}
> +
> +			for (; n; n = rb_next(n)) {
> +				unsigned int start;
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +							struct vfio_pfn, node);
> +
> +				if (vpfn->iova >= iova + size)
> +					return;
> +
> +				start = (vpfn->iova - start_iova) >> pgshift;
> +
> +				__bitmap_set(bitmap, start, 1);
> +
> +				i = vpfn->iova + pgsize;
> +			}
> +		}
> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> +	}
> +}
> +
> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> +{
> +	long bsize;
> +
> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> +		return -EINVAL;
> +
> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> +
> +	if (bitmap_size < bsize)
> +		return -EINVAL;
> +
> +	return bsize;
> +}
> +
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  			     struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> @@ -2298,6 +2417,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> +		struct vfio_iommu_type1_dirty_bitmap range;
> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +		int ret;
> +
> +		if (!iommu->v2)
> +			return -EACCES;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> +				    bitmap);
> +
> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (range.argsz < minsz || range.flags & ~mask)
> +			return -EINVAL;
> +
> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> +			iommu->dirty_page_tracking = true;
> +			return 0;
> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> +			iommu->dirty_page_tracking = false;
> +
> +			mutex_lock(&iommu->lock);
> +			vfio_remove_unpinned_from_dma_list(iommu);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +
> +		} else if (range.flags &
> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> +			uint64_t iommu_pgmask;
> +			unsigned long pgshift = __ffs(range.pgsize);
> +			unsigned long *bitmap;
> +			long bsize;
> +
> +			iommu_pgmask =
> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> +
> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> +			    (range.pgsize - 1))
> +				return -EINVAL;
> +
> +			if (range.iova & iommu_pgmask)
> +				return -EINVAL;
> +			if (!range.size || range.size > SIZE_MAX)
> +				return -EINVAL;
> +			if (range.iova + range.size < range.iova)
> +				return -EINVAL;
> +
> +			bsize = verify_bitmap_size(range.size >> pgshift,
> +						   range.bitmap_size);
> +			if (bsize)
> +				return ret;
> +
> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> +			if (!bitmap)
> +				return -ENOMEM;
> +
> +			ret = copy_from_user(bitmap,
> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> +			if (ret)
> +				goto bitmap_exit;
> +
> +			iommu->dirty_page_tracking = false;
why iommu->dirty_page_tracking is false here?
suppose this ioctl can be called several times.

Thanks
Yan
> +			mutex_lock(&iommu->lock);
> +			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
> +					     range.pgsize, range.iova, bitmap);
> +			mutex_unlock(&iommu->lock);
> +
> +			ret = copy_to_user((void __user *)range.bitmap, bitmap,
> +					   range.bitmap_size) ? -EFAULT : 0;
> +bitmap_exit:
> +			kfree(bitmap);
> +			return ret;
> +		}
>  	}
>  
>  	return -ENOTTY;
> -- 
> 2.7.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-16 22:44   ` Alex Williamson
@ 2019-12-17  6:28     ` Kirti Wankhede
  2019-12-17  7:12       ` Yan Zhao
  2019-12-17 18:43       ` Alex Williamson
  0 siblings, 2 replies; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-17  6:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 12/17/2019 4:14 AM, Alex Williamson wrote:
> On Tue, 17 Dec 2019 01:51:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Defined MIGRATION region type and sub-type.
>>
>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>    offset of migration region to get/set VFIO device related information.
>>    Defined members of structure and usage on read/write access.
>>
>> - Defined device states and added state transition details in the comment.
>>
>> - Added sequence to be followed while saving and resuming VFIO device state
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   include/uapi/linux/vfio.h | 180 ++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 180 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 9e843a147ead..a0817ba267c1 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>>   #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>>   #define VFIO_REGION_TYPE_GFX                    (1)
>>   #define VFIO_REGION_TYPE_CCW			(2)
>> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>>   
>>   /* sub-types for VFIO_REGION_TYPE_PCI_* */
>>   
>> @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid {
>>   /* sub-types for VFIO_REGION_TYPE_CCW */
>>   #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>>   
>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
>> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>> +
>> +/*
>> + * Structure vfio_device_migration_info is placed at 0th offset of
>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>> + * information. Field accesses from this structure are only supported at their
>> + * native width and alignment, otherwise the result is undefined and vendor
>> + * drivers should return an error.
>> + *
>> + * device_state: (read/write)
>> + *      To indicate vendor driver the state VFIO device should be transitioned
>> + *      to. If device state transition fails, write on this field return error.
>> + *      It consists of 3 bits:
>> + *      - If bit 0 set, indicates _RUNNING state. When its clear, that indicates
> 
> s/its/it's/
> 
>> + *        _STOP state. When device is changed to _STOP, driver should stop
>> + *        device before write() returns.
>> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
>> + *        should start gathering device state information which will be provided
>> + *        to VFIO user space application to save device's state.
>> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
>> + *        prepare to resume device, data provided through migration region
>> + *        should be used to resume device.
>> + *      Bits 3 - 31 are reserved for future use. User should perform
>> + *      read-modify-write operation on this field.
>> + *
>> + *  +------- _RESUMING
>> + *  |+------ _SAVING
>> + *  ||+----- _RUNNING
>> + *  |||
>> + *  000b => Device Stopped, not saving or resuming
>> + *  001b => Device running state, default state
>> + *  010b => Stop Device & save device state, stop-and-copy state
>> + *  011b => Device running and save device state, pre-copy state
>> + *  100b => Device stopped and device state is resuming
>> + *  101b => Invalid state
> 
> Eventually this would be intended for post-copy, if supported by the
> device, right?
> 

No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't 
be used for post-copy. New flag will be required for post-copy.

https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html

>> + *  110b => Invalid state
>> + *  111b => Invalid state
>> + *
>> + * State transitions:
>> + *
>> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
>> + *                (100b)     (001b)     (011b)        (010b)       (000b)
>> + * 0. Running or Default state
>> + *                             |
>> + *
>> + * 1. Normal Shutdown
> 
> Optional, userspace is under no obligation.
> 
>> + *                             |------------------------------------->|
>> + *
>> + * 2. Save state or Suspend
>> + *                             |------------------------->|---------->|
>> + *
>> + * 3. Save state during live migration
>> + *                             |----------->|------------>|---------->|
>> + *
>> + * 4. Resuming
>> + *                  |<---------|
>> + *
>> + * 5. Resumed
>> + *                  |--------->|
>> + *
>> + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts.
>> + * 1. During normal VFIO application shutdown, vfio device state changes
>> + *    from _RUNNING to _STOP.
> 
> We cannot impose this requirement on existing userspace.  Userspace may
> perform this action, but they are not required to and the vendor driver
> must not require it.

Updated comment.

> 
>> + * 2. When VFIO application save state or suspend application, VFIO device
>> + *    state transition is from _RUNNING to stop-and-copy state and then to
>> + *    _STOP.
>> + *    On state transition from _RUNNING to stop-and-copy, driver must
>> + *    stop device, save device state and send it to application through
>> + *    migration region.
>> + *    On _RUNNING to stop-and-copy state transition failure, application should
>> + *    set VFIO device state to _RUNNING.
> 
> A state transition failure means that the user's write to device_state
> failed, so is it the user's responsibility to set the next state?

Right.

>  Why
> is it necessarily _RUNNING vs _STOP?
>

While changing From pre-copy to stop-and-copy transition, device is 
still running, only saving of device state started. Now if transition to 
stop-and-copy fails, from user point of view application or VM is still 
running, device state should be set to _RUNNING so that whatever the 
application/VM is running should continue at source.


>> + * 3. In VFIO application live migration, state transition is from _RUNNING
>> + *    to pre-copy to stop-and-copy to _STOP.
>> + *    On state transition from _RUNNING to pre-copy, driver should start
>> + *    gathering device state while application is still running and send device
>> + *    state data to application through migration region.
>> + *    On state transition from pre-copy to stop-and-copy, driver must stop
>> + *    device, save device state and send it to application through migration
>> + *    region.
>> + *    On any failure during any of these state transition, VFIO device state
>> + *    should be set to _RUNNING.
> 
> Same comment as above regarding next state on failure.
> 

If application or VM migration fails, it should continue to run at 
source. In case of VM, guest user isn't aware of migration, and from his 
point VM should be running.

> Also, it seems like it's the vendor driver's discretion to actually
> provide data during the pre-copy phase.  As we've defined it, the
> vendor driver needs to participate in the migration region regardless,
> they might just always report no pending_bytes until we enter
> stop-and-copy.
> 

Yes. And if pending_bytes are reported as 0 in pre-copy by vendor driver 
then QEMU doesn't reiterate for that device.

>> + * 4. To start resuming phase, VFIO device state should be transitioned from
>> + *    _RUNNING to _RESUMING state.
>> + *    In _RESUMING state, driver should use received device state data through
>> + *    migration region to resume device.
>> + *    On failure during this state transition, application should set _RUNNING
>> + *    state.
> 
> Same comment regarding setting next state after failure.

If device couldn't be transitioned to _RESUMING, then it should be set 
to default state, that is _RUNNING.

> 
>> + * 5. On providing saved device data to driver, appliation should change state
>> + *    from _RESUMING to _RUNNING.
>> + *    On failure to transition to _RUNNING state, VFIO application should reset
>> + *    the device and set _RUNNING state so that device doesn't remain in unknown
>> + *    or bad state. On reset, driver must reset device and device should be
>> + *    available in default usable state.
> 
> Didn't we discuss that the reset ioctl should return the device to the
> initial state, including the transition to _RUNNING?

Yes, that's default usable state, rewording it to initial state.

>  Also, as above,
> it's the user write that triggers the failure, this register is listed
> as read-write, so what value does the vendor driver report for the
> state when read after a transition failure?  Is it reported as _RESUMING
> as it was prior to the attempted transition, or may the invalid states
> be used by the vendor driver to indicate the device is broken?
> 

If transition as failed, device should report its previous state and 
reset device should bring back to usable _RUNNING state.

>> + *
>> + * pending bytes: (read only)
>> + *      Number of pending bytes yet to be migrated from vendor driver
>> + *
>> + * data_offset: (read only)
>> + *      User application should read data_offset in migration region from where
>> + *      user application should read device data during _SAVING state or write
>> + *      device data during _RESUMING state. See below for detail of sequence to
>> + *      be followed.
>> + *
>> + * data_size: (read/write)
>> + *      User application should read data_size to get size of data copied in
>> + *      bytes in migration region during _SAVING state and write size of data
>> + *      copied in bytes in migration region during _RESUMING state.
>> + *
>> + * Migration region looks like:
>> + *  ------------------------------------------------------------------
>> + * |vfio_device_migration_info|    data section                      |
>> + * |                          |     ///////////////////////////////  |
>> + * ------------------------------------------------------------------
>> + *   ^                              ^
>> + *  offset 0-trapped part        data_offset
>> + *
>> + * Structure vfio_device_migration_info is always followed by data section in
>> + * the region, so data_offset will always be non-0. Offset from where data is
>> + * copied is decided by kernel driver, data section can be trapped or mapped
>> + * or partitioned, depending on how kernel driver defines data section.
>> + * Data section partition can be defined as mapped by sparse mmap capability.
>> + * If mmapped, then data_offset should be page aligned, where as initial section
>> + * which contain vfio_device_migration_info structure might not end at offset
>> + * which is page aligned. The user is not required to access via mmap regardless
>> + * of the region mmap capabilities.
>> + * Vendor driver should decide whether to partition data section and how to
>> + * partition the data section. Vendor driver should return data_offset
>> + * accordingly.
>> + *
>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>> + * and for _SAVING device state or stop-and-copy phase:
>> + * a. read pending_bytes, indicates start of new iteration to get device data.
>> + *    If there was previous iteration, then this read operation indicates
>> + *    previous iteration is done. If pending_bytes > 0, go through below steps.
>> + * b. read data_offset, indicates kernel driver to make data available through
>> + *    data section. Kernel driver should return this read operation only after
>> + *    data is available from (region + data_offset) to (region + data_offset +
>> + *    data_size).
>> + * c. read data_size, amount of data in bytes available through migration
>> + *    region.
>> + * d. read data of data_size bytes from (region + data_offset) from migration
>> + *    region.
>> + * e. process data.
>> + * f. Loop through a to e.
> 
> It seems we always need to end an iteration by reading pending_bytes to
> signal to the vendor driver to release resources, so should the end of
> the loop be:
> 
> e. Read pending_bytes
> f. Goto b. or optionally restart next iteration at a.
> 
> I think this is defined such that reading data_offset commits resources
> and reading pending_bytes frees them, allowing userspace to restart at
> reading pending_bytes with no side-effects.  Therefore reading
> pending_bytes repeatedly is supported.  Is the same true for
> data_offset and data_size?  It seems reasonable that the vendor driver
> can simply return offset and size for the current buffer if the user
> reads these more than once.
>

Right.


> How is a protocol or device error signaled?  For example, we can have a
> user error where they read data_size before data_offset.  Should the
> vendor driver generate a fault reading data_size in this case.  We can
> also have internal errors in the vendor driver, should the vendor
> driver use a special errno or update device_state autonomously to
> indicate such an error?

If there is any error during the sequence, vendor driver can return 
error code for next read/write operation, that will terminate the loop 
and migration would fail.

> 
> I believe it's also part of the intended protocol that the user can
> transition from _SAVING|_RUNNING to _SAVING at any point, regardless of
> pending_bytes.  This should be noted.
> 

Ok. Updating comment.

>> + *
>> + * Sequence to be followed while _RESUMING device state:
>> + * While data for this device is available, repeat below steps:
>> + * a. read data_offset from where user application should write data.
>> + * b. write data of data_size to migration region from data_offset.
> 
> Whose's data_size, the _SAVING end or the _RESUMING end?  I think this
> is intended to be the transaction size from the _SAVING source, 

Not necessarily. data_size could be MIN(transaction size of source, 
migration data section). If migration data section is smaller than data 
packet size at source, then it has to be broken and iteratively sent.

> but it
> could easily be misinterpreted as reading data_size on the _RESUMING
> end.
> 
>> + * c. write data_size which indicates vendor driver that data is written in
>> + *    staging buffer. Vendor driver should read this data from migration
>> + *    region and resume device's state.
> 
> I think we also need to define the error protocol.  The user could
> mis-order transactions or there could be an internal error in the
> vendor driver or device.  Are all read(2)/write(2) operations
> susceptible to defined errnos to signal this?

Yes.

>  Is it reflected in
> device_state?  

No.

> What's the recovery protocol?
> 

On read()/write() failure user should take necessary action.


>> + *
>> + * For user application, data is opaque. User should write data in the same
>> + * order as received.
> 
> Order and transaction size, ie. each data_size chunk is indivisible by
> the user.

Transaction size can differ, but order should remain same.

> 
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +	__u32 device_state;         /* VFIO device state */
>> +#define VFIO_DEVICE_STATE_STOP      (1 << 0)
>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> 
> Huh?  We should probably just refer to it consistently, ie. _RUNNING
> and !_RUNNING, otherwise we have the incongruity that setting the _STOP
> value is actually the opposite of the necessary logic value (_STOP = 1
> is _RUNNING, _STOP = 0 is !_RUNNING).

Ops, my mistake, forgot to update to
#define VFIO_DEVICE_STATE_STOP      (0)

> 
>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>> +				     VFIO_DEVICE_STATE_SAVING |  \
>> +				     VFIO_DEVICE_STATE_RESUMING)
>> +
>> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
>> +					    VFIO_DEVICE_STATE_RESUMING)
>> +
>> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
>> +					    VFIO_DEVICE_STATE_RESUMING)
> 
> Gack, we fixed these in the last iteration!
> 

That solution doesn't scale when new flags will be added. I still prefer 
to define as above.

Thanks,
Kirti

>> +	__u32 reserved;
>> +	__u64 pending_bytes;
>> +	__u64 data_offset;
>> +	__u64 data_size;
>> +} __attribute__((packed));
>> +
>>   /*
>>    * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>    * which allows direct access to non-MSIX registers which happened to be within
> 
> Thanks,
> Alex
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 2/5] vfio iommu: Adds flag to indicate dirty pages tracking capability support
  2019-12-16 23:16   ` Alex Williamson
@ 2019-12-17  6:32     ` Kirti Wankhede
  0 siblings, 0 replies; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-17  6:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 12/17/2019 4:46 AM, Alex Williamson wrote:
> On Tue, 17 Dec 2019 01:51:37 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Flag VFIO_IOMMU_INFO_DIRTY_PGS in VFIO_IOMMU_GET_INFO indicates that driver
>> support dirty pages tracking.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   drivers/vfio/vfio_iommu_type1.c | 3 ++-
>>   include/uapi/linux/vfio.h       | 5 +++--
>>   2 files changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 2ada8e6cdb88..3f6b04f2334f 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -2234,7 +2234,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>   			info.cap_offset = 0; /* output, no-recopy necessary */
>>   		}
>>   
>> -		info.flags = VFIO_IOMMU_INFO_PGSIZES;
>> +		info.flags = VFIO_IOMMU_INFO_PGSIZES |
>> +			     VFIO_IOMMU_INFO_DIRTY_PGS;
> 
> Type1 shouldn't advertise it until it's supported though, right?
> Thanks,
> 

Should this be merged with last patch where VFIO_IOMMU_UNMAP_DMA ioctl 
is updated?

Thanks,
Kirti

> Alex
> 
>>   
>>   		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>>   
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index a0817ba267c1..81847ed54eb7 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -900,8 +900,9 @@ struct vfio_device_ioeventfd {
>>   struct vfio_iommu_type1_info {
>>   	__u32	argsz;
>>   	__u32	flags;
>> -#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
>> -#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
>> +#define VFIO_IOMMU_INFO_PGSIZES   (1 << 0) /* supported page sizes info */
>> +#define VFIO_IOMMU_INFO_CAPS      (1 << 1) /* Info supports caps */
>> +#define VFIO_IOMMU_INFO_DIRTY_PGS (1 << 2) /* supports dirty page tracking */
>>   	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
>>   	__u32   cap_offset;	/* Offset within info struct of first cap */
>>   };
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-17  6:28     ` Kirti Wankhede
@ 2019-12-17  7:12       ` Yan Zhao
  2019-12-17 18:43       ` Alex Williamson
  1 sibling, 0 replies; 44+ messages in thread
From: Yan Zhao @ 2019-12-17  7:12 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Alex Williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Tue, Dec 17, 2019 at 02:28:44PM +0800, Kirti Wankhede wrote:
> 
> 
> On 12/17/2019 4:14 AM, Alex Williamson wrote:
> > On Tue, 17 Dec 2019 01:51:36 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> >> - Defined MIGRATION region type and sub-type.
> >>
> >> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>    offset of migration region to get/set VFIO device related information.
> >>    Defined members of structure and usage on read/write access.
> >>
> >> - Defined device states and added state transition details in the comment.
> >>
> >> - Added sequence to be followed while saving and resuming VFIO device state
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   include/uapi/linux/vfio.h | 180 ++++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 180 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 9e843a147ead..a0817ba267c1 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
> >>   #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
> >>   #define VFIO_REGION_TYPE_GFX                    (1)
> >>   #define VFIO_REGION_TYPE_CCW			(2)
> >> +#define VFIO_REGION_TYPE_MIGRATION              (3)
> >>   
> >>   /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >>   
> >> @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid {
> >>   /* sub-types for VFIO_REGION_TYPE_CCW */
> >>   #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
> >>   
> >> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> >> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> >> +
> >> +/*
> >> + * Structure vfio_device_migration_info is placed at 0th offset of
> >> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >> + * information. Field accesses from this structure are only supported at their
> >> + * native width and alignment, otherwise the result is undefined and vendor
> >> + * drivers should return an error.
> >> + *
> >> + * device_state: (read/write)
> >> + *      To indicate vendor driver the state VFIO device should be transitioned
> >> + *      to. If device state transition fails, write on this field return error.
> >> + *      It consists of 3 bits:
> >> + *      - If bit 0 set, indicates _RUNNING state. When its clear, that indicates
> > 
> > s/its/it's/
> > 
> >> + *        _STOP state. When device is changed to _STOP, driver should stop
> >> + *        device before write() returns.
> >> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> >> + *        should start gathering device state information which will be provided
> >> + *        to VFIO user space application to save device's state.
> >> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> >> + *        prepare to resume device, data provided through migration region
> >> + *        should be used to resume device.
> >> + *      Bits 3 - 31 are reserved for future use. User should perform
> >> + *      read-modify-write operation on this field.
> >> + *
> >> + *  +------- _RESUMING
> >> + *  |+------ _SAVING
> >> + *  ||+----- _RUNNING
> >> + *  |||
> >> + *  000b => Device Stopped, not saving or resuming
> >> + *  001b => Device running state, default state
> >> + *  010b => Stop Device & save device state, stop-and-copy state
> >> + *  011b => Device running and save device state, pre-copy state
> >> + *  100b => Device stopped and device state is resuming
> >> + *  101b => Invalid state
> > 
> > Eventually this would be intended for post-copy, if supported by the
> > device, right?
> > 
> 
> No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't 
> be used for post-copy. New flag will be required for post-copy.
> 
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html

sorry, I didn't mean _RESUMING + _RUNNING can't be used for post-copy.
I just mean another POSCOPY state needs to be introduced. But I'm not
sure what _RESUMING state is for.
actually, we do nothing in response to _RESUMING state, no matter precopy
or poscopy.
If in your side_RESUMING state means it is allowed to restore device data,
then I think _RESUMING + _RUNNING is a valid state for postcopy.

Thanks
Yan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-17  5:15   ` Yan Zhao
@ 2019-12-17  9:24     ` Kirti Wankhede
  2019-12-17  9:51       ` Yan Zhao
  2019-12-18 21:39       ` Alex Williamson
  0 siblings, 2 replies; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-17  9:24 UTC (permalink / raw)
  To: Yan Zhao
  Cc: alex.williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm



On 12/17/2019 10:45 AM, Yan Zhao wrote:
> On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:
>> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
>> - Start unpinned pages dirty pages tracking while migration is active and
>>    device is running, i.e. during pre-copy phase.
>> - Stop unpinned pages dirty pages tracking. This is required to stop
>>    unpinned dirty pages tracking if migration failed or cancelled during
>>    pre-copy phase. Unpinned pages tracking is clear.
>> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
>>    unpinned pages information on bitmap read. This ioctl returns bitmap of
>>    dirty pages, its user space application responsibility to copy content
>>    of dirty pages from source to destination during migration.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   drivers/vfio/vfio_iommu_type1.c | 210 ++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 203 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 3f6b04f2334f..264449654d3f 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -70,6 +70,7 @@ struct vfio_iommu {
>>   	unsigned int		dma_avail;
>>   	bool			v2;
>>   	bool			nesting;
>> +	bool			dirty_page_tracking;
>>   };
>>   
>>   struct vfio_domain {
>> @@ -112,6 +113,7 @@ struct vfio_pfn {
>>   	dma_addr_t		iova;		/* Device address */
>>   	unsigned long		pfn;		/* Host pfn */
>>   	atomic_t		ref_count;
>> +	bool			unpinned;
>>   };
>>   
>>   struct vfio_regions {
>> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>>   	kfree(vpfn);
>>   }
>>   
>> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
>> +{
>> +	struct rb_node *n = rb_first(&dma->pfn_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
>> +
>> +		if (warn)
>> +			WARN_ON_ONCE(vpfn->unpinned);
>> +
>> +		if (vpfn->unpinned)
>> +			vfio_remove_from_pfn_list(dma, vpfn);
>> +	}
>> +}
>> +
>> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
>> +{
>> +	struct rb_node *n = rb_first(&iommu->dma_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
>> +	}
>> +}
>> +
>>   static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>>   					       unsigned long iova)
>>   {
>> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>>   	return vpfn;
>>   }
>>   
>> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
>> +				  bool dirty_tracking)
>>   {
>>   	int ret = 0;
>>   
>>   	if (atomic_dec_and_test(&vpfn->ref_count)) {
>>   		ret = put_pfn(vpfn->pfn, dma->prot);
> if physical page here is put, it may cause problem when pin this iova
> next time:
> vfio_iommu_type1_pin_pages {
>      ...
>      vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>      if (vpfn) {
>          phys_pfn[i] = vpfn->pfn;
>          continue;
>      }
>      ...
> }
> 

Good point. Fixing it as:

                 vpfn = vfio_iova_get_vfio_pfn(dma, iova);
                 if (vpfn) {
-                       phys_pfn[i] = vpfn->pfn;
-                       continue;
+                       if (vpfn->unpinned)
+                               vfio_remove_from_pfn_list(dma, vpfn);
+                       else {
+                               phys_pfn[i] = vpfn->pfn;
+                               continue;
+                       }
                 }



>> -		vfio_remove_from_pfn_list(dma, vpfn);
>> +		if (dirty_tracking)
>> +			vpfn->unpinned = true;
>> +		else
>> +			vfio_remove_from_pfn_list(dma, vpfn);
> so the unpinned pages before dirty page tracking is not treated as
> dirty?
> 

Yes. That's we agreed on previous version:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg663157.html

>>   	}
>>   	return ret;
>>   }
>> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>>   }
>>   
>>   static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>> -				    bool do_accounting)
>> +				    bool do_accounting, bool dirty_tracking)
>>   {
>>   	int unlocked;
>>   	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
>> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>>   	if (!vpfn)
>>   		return 0;
>>   
>> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
>> +	if (vpfn->unpinned)
>> +		return 0;
>> +
>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
>>   
>>   	if (do_accounting)
>>   		vfio_lock_acct(dma, -unlocked, true);
>> @@ -583,7 +618,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   
>>   		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>>   		if (ret) {
>> -			vfio_unpin_page_external(dma, iova, do_accounting);
>> +			vfio_unpin_page_external(dma, iova, do_accounting,
>> +						 false);
>>   			goto pin_unwind;
>>   		}
>>   	}
>> @@ -598,7 +634,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   
>>   		iova = user_pfn[j] << PAGE_SHIFT;
>>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
>>   		phys_pfn[j] = 0;
>>   	}
>>   pin_done:
>> @@ -632,7 +668,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>>   		if (!dma)
>>   			goto unpin_exit;
>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>> +		vfio_unpin_page_external(dma, iova, do_accounting,
>> +					 iommu->dirty_page_tracking);
>>   	}
>>   
>>   unpin_exit:
>> @@ -850,6 +887,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>>   	return bitmap;
>>   }
>>   
>> +/*
>> + * start_iova is the reference from where bitmaping started. This is called
>> + * from DMA_UNMAP where start_iova can be different than iova
>> + */
>> +
>> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>> +				  size_t size, uint64_t pgsize,
>> +				  dma_addr_t start_iova, unsigned long *bitmap)
>> +{
>> +	struct vfio_dma *dma;
>> +	dma_addr_t i = iova;
>> +	unsigned long pgshift = __ffs(pgsize);
>> +
>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
>> +		/* mark all pages dirty if all pages are pinned and mapped. */
>> +		if (dma->iommu_mapped) {
> This prevents pass-through devices from calling vfio_pin_pages to do
> fine grained log dirty.

Yes, I mentioned that in yet TODO item in cover letter:

"If IOMMU capable device is present in the container, then all pages are
marked dirty. Need to think smart way to know if IOMMU capable device's
driver is smart to report pages to be marked dirty by pinning those 
pages externally."


>> +			dma_addr_t iova_limit;
>> +
>> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
>> +				     (dma->iova + dma->size) : (iova + size);
>> +
>> +			for (; i < iova_limit; i += pgsize) {
>> +				unsigned int start;
>> +
>> +				start = (i - start_iova) >> pgshift;
>> +
>> +				__bitmap_set(bitmap, start, 1);
>> +			}
>> +			if (i >= iova + size)
>> +				return;
>> +		} else {
>> +			struct rb_node *n = rb_first(&dma->pfn_list);
>> +			bool found = false;
>> +
>> +			for (; n; n = rb_next(n)) {
>> +				struct vfio_pfn *vpfn = rb_entry(n,
>> +							struct vfio_pfn, node);
>> +				if (vpfn->iova >= i) {
>> +					found = true;
>> +					break;
>> +				}
>> +			}
>> +
>> +			if (!found) {
>> +				i += dma->size;
>> +				continue;
>> +			}
>> +
>> +			for (; n; n = rb_next(n)) {
>> +				unsigned int start;
>> +				struct vfio_pfn *vpfn = rb_entry(n,
>> +							struct vfio_pfn, node);
>> +
>> +				if (vpfn->iova >= iova + size)
>> +					return;
>> +
>> +				start = (vpfn->iova - start_iova) >> pgshift;
>> +
>> +				__bitmap_set(bitmap, start, 1);
>> +
>> +				i = vpfn->iova + pgsize;
>> +			}
>> +		}
>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
>> +	}
>> +}
>> +
>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>> +{
>> +	long bsize;
>> +
>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
>> +		return -EINVAL;
>> +
>> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
>> +
>> +	if (bitmap_size < bsize)
>> +		return -EINVAL;
>> +
>> +	return bsize;
>> +}
>> +
>>   static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>   			     struct vfio_iommu_type1_dma_unmap *unmap)
>>   {
>> @@ -2298,6 +2417,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>   
>>   		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>   			-EFAULT : 0;
>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>> +		struct vfio_iommu_type1_dirty_bitmap range;
>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>> +		int ret;
>> +
>> +		if (!iommu->v2)
>> +			return -EACCES;
>> +
>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>> +				    bitmap);
>> +
>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (range.argsz < minsz || range.flags & ~mask)
>> +			return -EINVAL;
>> +
>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>> +			iommu->dirty_page_tracking = true;
>> +			return 0;
>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>> +			iommu->dirty_page_tracking = false;
>> +
>> +			mutex_lock(&iommu->lock);
>> +			vfio_remove_unpinned_from_dma_list(iommu);
>> +			mutex_unlock(&iommu->lock);
>> +			return 0;
>> +
>> +		} else if (range.flags &
>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>> +			uint64_t iommu_pgmask;
>> +			unsigned long pgshift = __ffs(range.pgsize);
>> +			unsigned long *bitmap;
>> +			long bsize;
>> +
>> +			iommu_pgmask =
>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>> +
>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
>> +			    (range.pgsize - 1))
>> +				return -EINVAL;
>> +
>> +			if (range.iova & iommu_pgmask)
>> +				return -EINVAL;
>> +			if (!range.size || range.size > SIZE_MAX)
>> +				return -EINVAL;
>> +			if (range.iova + range.size < range.iova)
>> +				return -EINVAL;
>> +
>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>> +						   range.bitmap_size);
>> +			if (bsize)
>> +				return ret;
>> +
>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
>> +			if (!bitmap)
>> +				return -ENOMEM;
>> +
>> +			ret = copy_from_user(bitmap,
>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
>> +			if (ret)
>> +				goto bitmap_exit;
>> +
>> +			iommu->dirty_page_tracking = false;
> why iommu->dirty_page_tracking is false here?
> suppose this ioctl can be called several times.
> 

This ioctl can be called several times, but once this ioctl is called 
that means vCPUs are stopped and VFIO devices are stopped (i.e. in 
stop-and-copy phase) and dirty pages bitmap are being queried by user.

Thanks,
Kirti


> Thanks
> Yan
>> +			mutex_lock(&iommu->lock);
>> +			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
>> +					     range.pgsize, range.iova, bitmap);
>> +			mutex_unlock(&iommu->lock);
>> +
>> +			ret = copy_to_user((void __user *)range.bitmap, bitmap,
>> +					   range.bitmap_size) ? -EFAULT : 0;
>> +bitmap_exit:
>> +			kfree(bitmap);
>> +			return ret;
>> +		}
>>   	}
>>   
>>   	return -ENOTTY;
>> -- 
>> 2.7.0
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-17  9:24     ` Kirti Wankhede
@ 2019-12-17  9:51       ` Yan Zhao
  2019-12-17 11:47         ` Kirti Wankhede
  2019-12-18 21:39       ` Alex Williamson
  1 sibling, 1 reply; 44+ messages in thread
From: Yan Zhao @ 2019-12-17  9:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Tue, Dec 17, 2019 at 05:24:14PM +0800, Kirti Wankhede wrote:
> 
> 
> On 12/17/2019 10:45 AM, Yan Zhao wrote:
> > On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:
> >> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> >> - Start unpinned pages dirty pages tracking while migration is active and
> >>    device is running, i.e. during pre-copy phase.
> >> - Stop unpinned pages dirty pages tracking. This is required to stop
> >>    unpinned dirty pages tracking if migration failed or cancelled during
> >>    pre-copy phase. Unpinned pages tracking is clear.
> >> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
> >>    unpinned pages information on bitmap read. This ioctl returns bitmap of
> >>    dirty pages, its user space application responsibility to copy content
> >>    of dirty pages from source to destination during migration.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   drivers/vfio/vfio_iommu_type1.c | 210 ++++++++++++++++++++++++++++++++++++++--
> >>   1 file changed, 203 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 3f6b04f2334f..264449654d3f 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -70,6 +70,7 @@ struct vfio_iommu {
> >>   	unsigned int		dma_avail;
> >>   	bool			v2;
> >>   	bool			nesting;
> >> +	bool			dirty_page_tracking;
> >>   };
> >>   
> >>   struct vfio_domain {
> >> @@ -112,6 +113,7 @@ struct vfio_pfn {
> >>   	dma_addr_t		iova;		/* Device address */
> >>   	unsigned long		pfn;		/* Host pfn */
> >>   	atomic_t		ref_count;
> >> +	bool			unpinned;
> >>   };
> >>   
> >>   struct vfio_regions {
> >> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
> >>   	kfree(vpfn);
> >>   }
> >>   
> >> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
> >> +{
> >> +	struct rb_node *n = rb_first(&dma->pfn_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> >> +
> >> +		if (warn)
> >> +			WARN_ON_ONCE(vpfn->unpinned);
> >> +
> >> +		if (vpfn->unpinned)
> >> +			vfio_remove_from_pfn_list(dma, vpfn);
> >> +	}
> >> +}
> >> +
> >> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> >> +{
> >> +	struct rb_node *n = rb_first(&iommu->dma_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >> +
> >> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> >> +	}
> >> +}
> >> +
> >>   static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> >>   					       unsigned long iova)
> >>   {
> >> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> >>   	return vpfn;
> >>   }
> >>   
> >> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> >> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
> >> +				  bool dirty_tracking)
> >>   {
> >>   	int ret = 0;
> >>   
> >>   	if (atomic_dec_and_test(&vpfn->ref_count)) {
> >>   		ret = put_pfn(vpfn->pfn, dma->prot);
> > if physical page here is put, it may cause problem when pin this iova
> > next time:
> > vfio_iommu_type1_pin_pages {
> >      ...
> >      vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >      if (vpfn) {
> >          phys_pfn[i] = vpfn->pfn;
> >          continue;
> >      }
> >      ...
> > }
> > 
> 
> Good point. Fixing it as:
> 
>                  vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>                  if (vpfn) {
> -                       phys_pfn[i] = vpfn->pfn;
> -                       continue;
> +                       if (vpfn->unpinned)
> +                               vfio_remove_from_pfn_list(dma, vpfn);
what about updating vpfn instead?

> +                       else {
> +                               phys_pfn[i] = vpfn->pfn;
> +                               continue;
> +                       }
>                  }
> 
> 
> 
> >> -		vfio_remove_from_pfn_list(dma, vpfn);
> >> +		if (dirty_tracking)
> >> +			vpfn->unpinned = true;
> >> +		else
> >> +			vfio_remove_from_pfn_list(dma, vpfn);
> > so the unpinned pages before dirty page tracking is not treated as
> > dirty?
> > 
> 
> Yes. That's we agreed on previous version:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg663157.html
> 
> >>   	}
> >>   	return ret;
> >>   }
> >> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
> >>   }
> >>   
> >>   static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> >> -				    bool do_accounting)
> >> +				    bool do_accounting, bool dirty_tracking)
> >>   {
> >>   	int unlocked;
> >>   	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
> >> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> >>   	if (!vpfn)
> >>   		return 0;
> >>   
> >> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> >> +	if (vpfn->unpinned)
> >> +		return 0;
> >> +
> >> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
> >>   
> >>   	if (do_accounting)
> >>   		vfio_lock_acct(dma, -unlocked, true);
> >> @@ -583,7 +618,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   
> >>   		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >>   		if (ret) {
> >> -			vfio_unpin_page_external(dma, iova, do_accounting);
> >> +			vfio_unpin_page_external(dma, iova, do_accounting,
> >> +						 false);
> >>   			goto pin_unwind;
> >>   		}
> >>   	}
> >> @@ -598,7 +634,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   
> >>   		iova = user_pfn[j] << PAGE_SHIFT;
> >>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
> >>   		phys_pfn[j] = 0;
> >>   	}
> >>   pin_done:
> >> @@ -632,7 +668,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
> >>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >>   		if (!dma)
> >>   			goto unpin_exit;
> >> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >> +		vfio_unpin_page_external(dma, iova, do_accounting,
> >> +					 iommu->dirty_page_tracking);
> >>   	}
> >>   
> >>   unpin_exit:
> >> @@ -850,6 +887,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >>   	return bitmap;
> >>   }
> >>   
> >> +/*
> >> + * start_iova is the reference from where bitmaping started. This is called
> >> + * from DMA_UNMAP where start_iova can be different than iova
> >> + */
> >> +
> >> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> >> +				  size_t size, uint64_t pgsize,
> >> +				  dma_addr_t start_iova, unsigned long *bitmap)
> >> +{
> >> +	struct vfio_dma *dma;
> >> +	dma_addr_t i = iova;
> >> +	unsigned long pgshift = __ffs(pgsize);
> >> +
> >> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> >> +		/* mark all pages dirty if all pages are pinned and mapped. */
> >> +		if (dma->iommu_mapped) {
> > This prevents pass-through devices from calling vfio_pin_pages to do
> > fine grained log dirty.
> 
> Yes, I mentioned that in yet TODO item in cover letter:
> 
> "If IOMMU capable device is present in the container, then all pages are
> marked dirty. Need to think smart way to know if IOMMU capable device's
> driver is smart to report pages to be marked dirty by pinning those 
> pages externally."
>
why not just check first if any vpfn present for IOMMU capable devices?

> 
> >> +			dma_addr_t iova_limit;
> >> +
> >> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
> >> +				     (dma->iova + dma->size) : (iova + size);
> >> +
> >> +			for (; i < iova_limit; i += pgsize) {
> >> +				unsigned int start;
> >> +
> >> +				start = (i - start_iova) >> pgshift;
> >> +
> >> +				__bitmap_set(bitmap, start, 1);
> >> +			}
> >> +			if (i >= iova + size)
> >> +				return;
> >> +		} else {
> >> +			struct rb_node *n = rb_first(&dma->pfn_list);
> >> +			bool found = false;
> >> +
> >> +			for (; n; n = rb_next(n)) {
> >> +				struct vfio_pfn *vpfn = rb_entry(n,
> >> +							struct vfio_pfn, node);
> >> +				if (vpfn->iova >= i) {
> >> +					found = true;
> >> +					break;
> >> +				}
> >> +			}
> >> +
> >> +			if (!found) {
> >> +				i += dma->size;
> >> +				continue;
> >> +			}
> >> +
> >> +			for (; n; n = rb_next(n)) {
> >> +				unsigned int start;
> >> +				struct vfio_pfn *vpfn = rb_entry(n,
> >> +							struct vfio_pfn, node);
> >> +
> >> +				if (vpfn->iova >= iova + size)
> >> +					return;
> >> +
> >> +				start = (vpfn->iova - start_iova) >> pgshift;
> >> +
> >> +				__bitmap_set(bitmap, start, 1);
> >> +
> >> +				i = vpfn->iova + pgsize;
> >> +			}
> >> +		}
> >> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> >> +	}
> >> +}
> >> +
> >> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> >> +{
> >> +	long bsize;
> >> +
> >> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> >> +		return -EINVAL;
> >> +
> >> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> >> +
> >> +	if (bitmap_size < bsize)
> >> +		return -EINVAL;
> >> +
> >> +	return bsize;
> >> +}
> >> +
> >>   static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>   			     struct vfio_iommu_type1_dma_unmap *unmap)
> >>   {
> >> @@ -2298,6 +2417,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>   
> >>   		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>   			-EFAULT : 0;
> >> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> >> +		struct vfio_iommu_type1_dirty_bitmap range;
> >> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> >> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> >> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> >> +		int ret;
> >> +
> >> +		if (!iommu->v2)
> >> +			return -EACCES;
> >> +
> >> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> >> +				    bitmap);
> >> +
> >> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (range.argsz < minsz || range.flags & ~mask)
> >> +			return -EINVAL;
> >> +
> >> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >> +			iommu->dirty_page_tracking = true;
> >> +			return 0;
> >> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >> +			iommu->dirty_page_tracking = false;
> >> +
> >> +			mutex_lock(&iommu->lock);
> >> +			vfio_remove_unpinned_from_dma_list(iommu);
> >> +			mutex_unlock(&iommu->lock);
> >> +			return 0;
> >> +
> >> +		} else if (range.flags &
> >> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >> +			uint64_t iommu_pgmask;
> >> +			unsigned long pgshift = __ffs(range.pgsize);
> >> +			unsigned long *bitmap;
> >> +			long bsize;
> >> +
> >> +			iommu_pgmask =
> >> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >> +
> >> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> >> +			    (range.pgsize - 1))
> >> +				return -EINVAL;
> >> +
> >> +			if (range.iova & iommu_pgmask)
> >> +				return -EINVAL;
> >> +			if (!range.size || range.size > SIZE_MAX)
> >> +				return -EINVAL;
> >> +			if (range.iova + range.size < range.iova)
> >> +				return -EINVAL;
> >> +
> >> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >> +						   range.bitmap_size);
> >> +			if (bsize)
> >> +				return ret;
> >> +
> >> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> >> +			if (!bitmap)
> >> +				return -ENOMEM;
> >> +
> >> +			ret = copy_from_user(bitmap,
> >> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> >> +			if (ret)
> >> +				goto bitmap_exit;
> >> +
> >> +			iommu->dirty_page_tracking = false;
> > why iommu->dirty_page_tracking is false here?
> > suppose this ioctl can be called several times.
> > 
> 
> This ioctl can be called several times, but once this ioctl is called 
> that means vCPUs are stopped and VFIO devices are stopped (i.e. in 
> stop-and-copy phase) and dirty pages bitmap are being queried by user.
> 
can't agree that VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP can only be
called in stop-and-copy phase.
As stated in last version, this will cause QEMU to get a wrong expectation
of VM downtime and this is also the reason for previously pinned pages
before log_sync cannot be treated as dirty. If this get bitmap ioctl can
be called early in save_setup phase, then it's no problem even all ram
is dirty.

Thanks
Yan
> 
> 
> > Thanks
> > Yan
> >> +			mutex_lock(&iommu->lock);
> >> +			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
> >> +					     range.pgsize, range.iova, bitmap);
> >> +			mutex_unlock(&iommu->lock);
> >> +
> >> +			ret = copy_to_user((void __user *)range.bitmap, bitmap,
> >> +					   range.bitmap_size) ? -EFAULT : 0;
> >> +bitmap_exit:
> >> +			kfree(bitmap);
> >> +			return ret;
> >> +		}
> >>   	}
> >>   
> >>   	return -ENOTTY;
> >> -- 
> >> 2.7.0
> >>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-17  9:51       ` Yan Zhao
@ 2019-12-17 11:47         ` Kirti Wankhede
  2019-12-18  1:04           ` Yan Zhao
  0 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-17 11:47 UTC (permalink / raw)
  To: Yan Zhao
  Cc: alex.williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm



On 12/17/2019 3:21 PM, Yan Zhao wrote:
> On Tue, Dec 17, 2019 at 05:24:14PM +0800, Kirti Wankhede wrote:
>>
>>
>> On 12/17/2019 10:45 AM, Yan Zhao wrote:
>>> On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:
>>>> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
>>>> - Start unpinned pages dirty pages tracking while migration is active and
>>>>     device is running, i.e. during pre-copy phase.
>>>> - Stop unpinned pages dirty pages tracking. This is required to stop
>>>>     unpinned dirty pages tracking if migration failed or cancelled during
>>>>     pre-copy phase. Unpinned pages tracking is clear.
>>>> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
>>>>     unpinned pages information on bitmap read. This ioctl returns bitmap of
>>>>     dirty pages, its user space application responsibility to copy content
>>>>     of dirty pages from source to destination during migration.
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>    drivers/vfio/vfio_iommu_type1.c | 210 ++++++++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 203 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index 3f6b04f2334f..264449654d3f 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -70,6 +70,7 @@ struct vfio_iommu {
>>>>    	unsigned int		dma_avail;
>>>>    	bool			v2;
>>>>    	bool			nesting;
>>>> +	bool			dirty_page_tracking;
>>>>    };
>>>>    
>>>>    struct vfio_domain {
>>>> @@ -112,6 +113,7 @@ struct vfio_pfn {
>>>>    	dma_addr_t		iova;		/* Device address */
>>>>    	unsigned long		pfn;		/* Host pfn */
>>>>    	atomic_t		ref_count;
>>>> +	bool			unpinned;
>>>>    };
>>>>    
>>>>    struct vfio_regions {
>>>> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>>>>    	kfree(vpfn);
>>>>    }
>>>>    
>>>> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
>>>> +{
>>>> +	struct rb_node *n = rb_first(&dma->pfn_list);
>>>> +
>>>> +	for (; n; n = rb_next(n)) {
>>>> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
>>>> +
>>>> +		if (warn)
>>>> +			WARN_ON_ONCE(vpfn->unpinned);
>>>> +
>>>> +		if (vpfn->unpinned)
>>>> +			vfio_remove_from_pfn_list(dma, vpfn);
>>>> +	}
>>>> +}
>>>> +
>>>> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
>>>> +{
>>>> +	struct rb_node *n = rb_first(&iommu->dma_list);
>>>> +
>>>> +	for (; n; n = rb_next(n)) {
>>>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>>>> +
>>>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
>>>> +	}
>>>> +}
>>>> +
>>>>    static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>>>>    					       unsigned long iova)
>>>>    {
>>>> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>>>>    	return vpfn;
>>>>    }
>>>>    
>>>> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>>>> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
>>>> +				  bool dirty_tracking)
>>>>    {
>>>>    	int ret = 0;
>>>>    
>>>>    	if (atomic_dec_and_test(&vpfn->ref_count)) {
>>>>    		ret = put_pfn(vpfn->pfn, dma->prot);
>>> if physical page here is put, it may cause problem when pin this iova
>>> next time:
>>> vfio_iommu_type1_pin_pages {
>>>       ...
>>>       vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>>>       if (vpfn) {
>>>           phys_pfn[i] = vpfn->pfn;
>>>           continue;
>>>       }
>>>       ...
>>> }
>>>
>>
>> Good point. Fixing it as:
>>
>>                   vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>>                   if (vpfn) {
>> -                       phys_pfn[i] = vpfn->pfn;
>> -                       continue;
>> +                       if (vpfn->unpinned)
>> +                               vfio_remove_from_pfn_list(dma, vpfn);
> what about updating vpfn instead?
> 

vfio_pin_page_external() takes care of verification checks and mem lock 
accounting. I prefer to free existing and add new node with existing 
functions.

>> +                       else {
>> +                               phys_pfn[i] = vpfn->pfn;
>> +                               continue;
>> +                       }
>>                   }
>>
>>
>>
>>>> -		vfio_remove_from_pfn_list(dma, vpfn);
>>>> +		if (dirty_tracking)
>>>> +			vpfn->unpinned = true;
>>>> +		else
>>>> +			vfio_remove_from_pfn_list(dma, vpfn);
>>> so the unpinned pages before dirty page tracking is not treated as
>>> dirty?
>>>
>>
>> Yes. That's we agreed on previous version:
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg663157.html
>>
>>>>    	}
>>>>    	return ret;
>>>>    }
>>>> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>>>>    }
>>>>    
>>>>    static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>>>> -				    bool do_accounting)
>>>> +				    bool do_accounting, bool dirty_tracking)
>>>>    {
>>>>    	int unlocked;
>>>>    	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
>>>> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>>>>    	if (!vpfn)
>>>>    		return 0;
>>>>    
>>>> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
>>>> +	if (vpfn->unpinned)
>>>> +		return 0;
>>>> +
>>>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
>>>>    
>>>>    	if (do_accounting)
>>>>    		vfio_lock_acct(dma, -unlocked, true);
>>>> @@ -583,7 +618,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>    
>>>>    		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>>>>    		if (ret) {
>>>> -			vfio_unpin_page_external(dma, iova, do_accounting);
>>>> +			vfio_unpin_page_external(dma, iova, do_accounting,
>>>> +						 false);
>>>>    			goto pin_unwind;
>>>>    		}
>>>>    	}
>>>> @@ -598,7 +634,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>    
>>>>    		iova = user_pfn[j] << PAGE_SHIFT;
>>>>    		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>>>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>>>> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
>>>>    		phys_pfn[j] = 0;
>>>>    	}
>>>>    pin_done:
>>>> @@ -632,7 +668,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>>>>    		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>>>>    		if (!dma)
>>>>    			goto unpin_exit;
>>>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>>>> +		vfio_unpin_page_external(dma, iova, do_accounting,
>>>> +					 iommu->dirty_page_tracking);
>>>>    	}
>>>>    
>>>>    unpin_exit:
>>>> @@ -850,6 +887,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>>>>    	return bitmap;
>>>>    }
>>>>    
>>>> +/*
>>>> + * start_iova is the reference from where bitmaping started. This is called
>>>> + * from DMA_UNMAP where start_iova can be different than iova
>>>> + */
>>>> +
>>>> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>>>> +				  size_t size, uint64_t pgsize,
>>>> +				  dma_addr_t start_iova, unsigned long *bitmap)
>>>> +{
>>>> +	struct vfio_dma *dma;
>>>> +	dma_addr_t i = iova;
>>>> +	unsigned long pgshift = __ffs(pgsize);
>>>> +
>>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
>>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
>>>> +		if (dma->iommu_mapped) {
>>> This prevents pass-through devices from calling vfio_pin_pages to do
>>> fine grained log dirty.
>>
>> Yes, I mentioned that in yet TODO item in cover letter:
>>
>> "If IOMMU capable device is present in the container, then all pages are
>> marked dirty. Need to think smart way to know if IOMMU capable device's
>> driver is smart to report pages to be marked dirty by pinning those
>> pages externally."
>>
> why not just check first if any vpfn present for IOMMU capable devices?
> 

vfio_pin_pages(dev, ...) calls driver->ops->pin_pages(iommu, ...)

In vfio_iommu_type1 module, vfio_iommu_type1_pin_pages() doesn't know 
the device. vpfn are tracked against container->iommu, not against 
device. Need to think of smart way to know if devices in container are 
all smart which report pages dirty ny pinning those pages manually.


>>
>>>> +			dma_addr_t iova_limit;
>>>> +
>>>> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
>>>> +				     (dma->iova + dma->size) : (iova + size);
>>>> +
>>>> +			for (; i < iova_limit; i += pgsize) {
>>>> +				unsigned int start;
>>>> +
>>>> +				start = (i - start_iova) >> pgshift;
>>>> +
>>>> +				__bitmap_set(bitmap, start, 1);
>>>> +			}
>>>> +			if (i >= iova + size)
>>>> +				return;
>>>> +		} else {
>>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
>>>> +			bool found = false;
>>>> +
>>>> +			for (; n; n = rb_next(n)) {
>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>> +							struct vfio_pfn, node);
>>>> +				if (vpfn->iova >= i) {
>>>> +					found = true;
>>>> +					break;
>>>> +				}
>>>> +			}
>>>> +
>>>> +			if (!found) {
>>>> +				i += dma->size;
>>>> +				continue;
>>>> +			}
>>>> +
>>>> +			for (; n; n = rb_next(n)) {
>>>> +				unsigned int start;
>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>> +							struct vfio_pfn, node);
>>>> +
>>>> +				if (vpfn->iova >= iova + size)
>>>> +					return;
>>>> +
>>>> +				start = (vpfn->iova - start_iova) >> pgshift;
>>>> +
>>>> +				__bitmap_set(bitmap, start, 1);
>>>> +
>>>> +				i = vpfn->iova + pgsize;
>>>> +			}
>>>> +		}
>>>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
>>>> +	}
>>>> +}
>>>> +
>>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>>>> +{
>>>> +	long bsize;
>>>> +
>>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
>>>> +		return -EINVAL;
>>>> +
>>>> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
>>>> +
>>>> +	if (bitmap_size < bsize)
>>>> +		return -EINVAL;
>>>> +
>>>> +	return bsize;
>>>> +}
>>>> +
>>>>    static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>    			     struct vfio_iommu_type1_dma_unmap *unmap)
>>>>    {
>>>> @@ -2298,6 +2417,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>    
>>>>    		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>    			-EFAULT : 0;
>>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>>>> +		struct vfio_iommu_type1_dirty_bitmap range;
>>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>>>> +		int ret;
>>>> +
>>>> +		if (!iommu->v2)
>>>> +			return -EACCES;
>>>> +
>>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>>>> +				    bitmap);
>>>> +
>>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
>>>> +			return -EFAULT;
>>>> +
>>>> +		if (range.argsz < minsz || range.flags & ~mask)
>>>> +			return -EINVAL;
>>>> +
>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>>>> +			iommu->dirty_page_tracking = true;
>>>> +			return 0;
>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>>>> +			iommu->dirty_page_tracking = false;
>>>> +
>>>> +			mutex_lock(&iommu->lock);
>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
>>>> +			mutex_unlock(&iommu->lock);
>>>> +			return 0;
>>>> +
>>>> +		} else if (range.flags &
>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>>>> +			uint64_t iommu_pgmask;
>>>> +			unsigned long pgshift = __ffs(range.pgsize);
>>>> +			unsigned long *bitmap;
>>>> +			long bsize;
>>>> +
>>>> +			iommu_pgmask =
>>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>>>> +
>>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
>>>> +			    (range.pgsize - 1))
>>>> +				return -EINVAL;
>>>> +
>>>> +			if (range.iova & iommu_pgmask)
>>>> +				return -EINVAL;
>>>> +			if (!range.size || range.size > SIZE_MAX)
>>>> +				return -EINVAL;
>>>> +			if (range.iova + range.size < range.iova)
>>>> +				return -EINVAL;
>>>> +
>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>>>> +						   range.bitmap_size);
>>>> +			if (bsize)
>>>> +				return ret;
>>>> +
>>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
>>>> +			if (!bitmap)
>>>> +				return -ENOMEM;
>>>> +
>>>> +			ret = copy_from_user(bitmap,
>>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
>>>> +			if (ret)
>>>> +				goto bitmap_exit;
>>>> +
>>>> +			iommu->dirty_page_tracking = false;
>>> why iommu->dirty_page_tracking is false here?
>>> suppose this ioctl can be called several times.
>>>
>>
>> This ioctl can be called several times, but once this ioctl is called
>> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
>> stop-and-copy phase) and dirty pages bitmap are being queried by user.
>>
> can't agree that VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP can only be
> called in stop-and-copy phase.
> As stated in last version, this will cause QEMU to get a wrong expectation
> of VM downtime and this is also the reason for previously pinned pages
> before log_sync cannot be treated as dirty. If this get bitmap ioctl can
> be called early in save_setup phase, then it's no problem even all ram
> is dirty.
> 

Device can also write to pages which are pinned, and then there is no 
way to know pages dirtied by device during pre-copy phase.
If user ask dirty bitmap in per-copy phase, even then user will have to 
query dirty bitmap in stop-and-copy phase where this will be superset 
including all pages reported during pre-copy. Then instead of copying 
all pages twice, its better to do it once during stop-and-copy phase.

Thanks,
Kirti


> Thanks
> Yan
>>
>>
>>> Thanks
>>> Yan
>>>> +			mutex_lock(&iommu->lock);
>>>> +			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
>>>> +					     range.pgsize, range.iova, bitmap);
>>>> +			mutex_unlock(&iommu->lock);
>>>> +
>>>> +			ret = copy_to_user((void __user *)range.bitmap, bitmap,
>>>> +					   range.bitmap_size) ? -EFAULT : 0;
>>>> +bitmap_exit:
>>>> +			kfree(bitmap);
>>>> +			return ret;
>>>> +		}
>>>>    	}
>>>>    
>>>>    	return -ENOTTY;
>>>> -- 
>>>> 2.7.0
>>>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-17  6:28     ` Kirti Wankhede
  2019-12-17  7:12       ` Yan Zhao
@ 2019-12-17 18:43       ` Alex Williamson
  2019-12-19 16:08         ` Kirti Wankhede
  1 sibling, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2019-12-17 18:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 17 Dec 2019 11:58:44 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/17/2019 4:14 AM, Alex Williamson wrote:
> > On Tue, 17 Dec 2019 01:51:36 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> - Defined MIGRATION region type and sub-type.
> >>
> >> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>    offset of migration region to get/set VFIO device related information.
> >>    Defined members of structure and usage on read/write access.
> >>
> >> - Defined device states and added state transition details in the comment.
> >>
> >> - Added sequence to be followed while saving and resuming VFIO device state
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   include/uapi/linux/vfio.h | 180 ++++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 180 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 9e843a147ead..a0817ba267c1 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
> >>   #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
> >>   #define VFIO_REGION_TYPE_GFX                    (1)
> >>   #define VFIO_REGION_TYPE_CCW			(2)
> >> +#define VFIO_REGION_TYPE_MIGRATION              (3)
> >>   
> >>   /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >>   
> >> @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid {
> >>   /* sub-types for VFIO_REGION_TYPE_CCW */
> >>   #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
> >>   
> >> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> >> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> >> +
> >> +/*
> >> + * Structure vfio_device_migration_info is placed at 0th offset of
> >> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >> + * information. Field accesses from this structure are only supported at their
> >> + * native width and alignment, otherwise the result is undefined and vendor
> >> + * drivers should return an error.
> >> + *
> >> + * device_state: (read/write)
> >> + *      To indicate vendor driver the state VFIO device should be transitioned
> >> + *      to. If device state transition fails, write on this field return error.
> >> + *      It consists of 3 bits:
> >> + *      - If bit 0 set, indicates _RUNNING state. When its clear, that indicates  
> > 
> > s/its/it's/
> >   
> >> + *        _STOP state. When device is changed to _STOP, driver should stop
> >> + *        device before write() returns.
> >> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> >> + *        should start gathering device state information which will be provided
> >> + *        to VFIO user space application to save device's state.
> >> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> >> + *        prepare to resume device, data provided through migration region
> >> + *        should be used to resume device.
> >> + *      Bits 3 - 31 are reserved for future use. User should perform
> >> + *      read-modify-write operation on this field.
> >> + *
> >> + *  +------- _RESUMING
> >> + *  |+------ _SAVING
> >> + *  ||+----- _RUNNING
> >> + *  |||
> >> + *  000b => Device Stopped, not saving or resuming
> >> + *  001b => Device running state, default state
> >> + *  010b => Stop Device & save device state, stop-and-copy state
> >> + *  011b => Device running and save device state, pre-copy state
> >> + *  100b => Device stopped and device state is resuming
> >> + *  101b => Invalid state  
> > 
> > Eventually this would be intended for post-copy, if supported by the
> > device, right?
> >   
> 
> No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't 
> be used for post-copy. New flag will be required for post-copy.
> 
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html
> 
> >> + *  110b => Invalid state
> >> + *  111b => Invalid state
> >> + *
> >> + * State transitions:
> >> + *
> >> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> >> + *                (100b)     (001b)     (011b)        (010b)       (000b)
> >> + * 0. Running or Default state
> >> + *                             |
> >> + *
> >> + * 1. Normal Shutdown  
> > 
> > Optional, userspace is under no obligation.
> >   
> >> + *                             |------------------------------------->|
> >> + *
> >> + * 2. Save state or Suspend
> >> + *                             |------------------------->|---------->|
> >> + *
> >> + * 3. Save state during live migration
> >> + *                             |----------->|------------>|---------->|
> >> + *
> >> + * 4. Resuming
> >> + *                  |<---------|
> >> + *
> >> + * 5. Resumed
> >> + *                  |--------->|
> >> + *
> >> + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts.
> >> + * 1. During normal VFIO application shutdown, vfio device state changes
> >> + *    from _RUNNING to _STOP.  
> > 
> > We cannot impose this requirement on existing userspace.  Userspace may
> > perform this action, but they are not required to and the vendor driver
> > must not require it.  
> 
> Updated comment.
> 
> >   
> >> + * 2. When VFIO application save state or suspend application, VFIO device
> >> + *    state transition is from _RUNNING to stop-and-copy state and then to
> >> + *    _STOP.
> >> + *    On state transition from _RUNNING to stop-and-copy, driver must
> >> + *    stop device, save device state and send it to application through
> >> + *    migration region.
> >> + *    On _RUNNING to stop-and-copy state transition failure, application should
> >> + *    set VFIO device state to _RUNNING.  
> > 
> > A state transition failure means that the user's write to device_state
> > failed, so is it the user's responsibility to set the next state?  
> 
> Right.

If a transition failure occurs, ie. errno from write(2), what value is
reported by a read(2) of device_state in the interim between the failure
and a next state written by the user?  If this is a valid state,
wouldn't it be reasonable for the user to assume the device is already
operating in that state?  If it's an invalid state, do we need to
define the use cases for those invalid states?  If the user needs to
set the state back to _RUNNING, that suggests the device might be
stopped, which has implications beyond the migration state.

> >  Why
> > is it necessarily _RUNNING vs _STOP?
> >  
> 
> While changing From pre-copy to stop-and-copy transition, device is 
> still running, only saving of device state started. Now if transition to 
> stop-and-copy fails, from user point of view application or VM is still 
> running, device state should be set to _RUNNING so that whatever the 
> application/VM is running should continue at source.

Seems it's the users discretion whether to consider this continuable or
fatal, the vfio interface specification should support a given usage
model, not prescribe it.

> >> + * 3. In VFIO application live migration, state transition is from _RUNNING
> >> + *    to pre-copy to stop-and-copy to _STOP.
> >> + *    On state transition from _RUNNING to pre-copy, driver should start
> >> + *    gathering device state while application is still running and send device
> >> + *    state data to application through migration region.
> >> + *    On state transition from pre-copy to stop-and-copy, driver must stop
> >> + *    device, save device state and send it to application through migration
> >> + *    region.
> >> + *    On any failure during any of these state transition, VFIO device state
> >> + *    should be set to _RUNNING.  
> > 
> > Same comment as above regarding next state on failure.
> >   
> 
> If application or VM migration fails, it should continue to run at 
> source. In case of VM, guest user isn't aware of migration, and from his 
> point VM should be running.

vfio is not prescribing the migration semantics to userspace, it's
presenting an interface that support the user semantics.  Therefore,
while it's useful to understand the expected usage model, I think we
also need a mechanism that the user can always determine the
device_state after a fault and allowable state transitions independent
of the expected usage model.  For example, I think a user should always
be allowed to transition a device to stopped regardless of the expected
migration flow.  An error might have occurred elsewhere and we want to
stop everything for debugging.  I think it's also allowable to switch
directly from running to stop-and-copy, for example to save and resume
a VM offline.
 
> > Also, it seems like it's the vendor driver's discretion to actually
> > provide data during the pre-copy phase.  As we've defined it, the
> > vendor driver needs to participate in the migration region regardless,
> > they might just always report no pending_bytes until we enter
> > stop-and-copy.
> >   
> 
> Yes. And if pending_bytes are reported as 0 in pre-copy by vendor driver 
> then QEMU doesn't reiterate for that device.

Maybe we can state that as the expected mechanism to avoid a vendor
driver trying to invent alternative means, ex. failing transition to
pre-copy, requesting new flags, etc.

> >> + * 4. To start resuming phase, VFIO device state should be transitioned from
> >> + *    _RUNNING to _RESUMING state.
> >> + *    In _RESUMING state, driver should use received device state data through
> >> + *    migration region to resume device.
> >> + *    On failure during this state transition, application should set _RUNNING
> >> + *    state.  
> > 
> > Same comment regarding setting next state after failure.  
> 
> If device couldn't be transitioned to _RESUMING, then it should be set 
> to default state, that is _RUNNING.
> 
> >   
> >> + * 5. On providing saved device data to driver, appliation should change state
> >> + *    from _RESUMING to _RUNNING.
> >> + *    On failure to transition to _RUNNING state, VFIO application should reset
> >> + *    the device and set _RUNNING state so that device doesn't remain in unknown
> >> + *    or bad state. On reset, driver must reset device and device should be
> >> + *    available in default usable state.  
> > 
> > Didn't we discuss that the reset ioctl should return the device to the
> > initial state, including the transition to _RUNNING?  
> 
> Yes, that's default usable state, rewording it to initial state.
> 
> >  Also, as above,
> > it's the user write that triggers the failure, this register is listed
> > as read-write, so what value does the vendor driver report for the
> > state when read after a transition failure?  Is it reported as _RESUMING
> > as it was prior to the attempted transition, or may the invalid states
> > be used by the vendor driver to indicate the device is broken?
> >   
> 
> If transition as failed, device should report its previous state and 
> reset device should bring back to usable _RUNNING state.

If device_state reports previous state then user should reasonably
infer that the device is already in that sate without a need for them
to set it, IMO.

> >> + *
> >> + * pending bytes: (read only)
> >> + *      Number of pending bytes yet to be migrated from vendor driver
> >> + *
> >> + * data_offset: (read only)
> >> + *      User application should read data_offset in migration region from where
> >> + *      user application should read device data during _SAVING state or write
> >> + *      device data during _RESUMING state. See below for detail of sequence to
> >> + *      be followed.
> >> + *
> >> + * data_size: (read/write)
> >> + *      User application should read data_size to get size of data copied in
> >> + *      bytes in migration region during _SAVING state and write size of data
> >> + *      copied in bytes in migration region during _RESUMING state.
> >> + *
> >> + * Migration region looks like:
> >> + *  ------------------------------------------------------------------
> >> + * |vfio_device_migration_info|    data section                      |
> >> + * |                          |     ///////////////////////////////  |
> >> + * ------------------------------------------------------------------
> >> + *   ^                              ^
> >> + *  offset 0-trapped part        data_offset
> >> + *
> >> + * Structure vfio_device_migration_info is always followed by data section in
> >> + * the region, so data_offset will always be non-0. Offset from where data is
> >> + * copied is decided by kernel driver, data section can be trapped or mapped
> >> + * or partitioned, depending on how kernel driver defines data section.
> >> + * Data section partition can be defined as mapped by sparse mmap capability.
> >> + * If mmapped, then data_offset should be page aligned, where as initial section
> >> + * which contain vfio_device_migration_info structure might not end at offset
> >> + * which is page aligned. The user is not required to access via mmap regardless
> >> + * of the region mmap capabilities.
> >> + * Vendor driver should decide whether to partition data section and how to
> >> + * partition the data section. Vendor driver should return data_offset
> >> + * accordingly.
> >> + *
> >> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >> + * and for _SAVING device state or stop-and-copy phase:
> >> + * a. read pending_bytes, indicates start of new iteration to get device data.
> >> + *    If there was previous iteration, then this read operation indicates
> >> + *    previous iteration is done. If pending_bytes > 0, go through below steps.
> >> + * b. read data_offset, indicates kernel driver to make data available through
> >> + *    data section. Kernel driver should return this read operation only after
> >> + *    data is available from (region + data_offset) to (region + data_offset +
> >> + *    data_size).
> >> + * c. read data_size, amount of data in bytes available through migration
> >> + *    region.
> >> + * d. read data of data_size bytes from (region + data_offset) from migration
> >> + *    region.
> >> + * e. process data.
> >> + * f. Loop through a to e.  
> > 
> > It seems we always need to end an iteration by reading pending_bytes to
> > signal to the vendor driver to release resources, so should the end of
> > the loop be:
> > 
> > e. Read pending_bytes
> > f. Goto b. or optionally restart next iteration at a.
> > 
> > I think this is defined such that reading data_offset commits resources
> > and reading pending_bytes frees them, allowing userspace to restart at
> > reading pending_bytes with no side-effects.  Therefore reading
> > pending_bytes repeatedly is supported.  Is the same true for
> > data_offset and data_size?  It seems reasonable that the vendor driver
> > can simply return offset and size for the current buffer if the user
> > reads these more than once.
> >  
> 
> Right.

Can we add that to the spec?

> > How is a protocol or device error signaled?  For example, we can have a
> > user error where they read data_size before data_offset.  Should the
> > vendor driver generate a fault reading data_size in this case.  We can
> > also have internal errors in the vendor driver, should the vendor
> > driver use a special errno or update device_state autonomously to
> > indicate such an error?  
> 
> If there is any error during the sequence, vendor driver can return 
> error code for next read/write operation, that will terminate the loop 
> and migration would fail.

Please add to spec.

> > I believe it's also part of the intended protocol that the user can
> > transition from _SAVING|_RUNNING to _SAVING at any point, regardless of
> > pending_bytes.  This should be noted.
> >   
> 
> Ok. Updating comment.
> 
> >> + *
> >> + * Sequence to be followed while _RESUMING device state:
> >> + * While data for this device is available, repeat below steps:
> >> + * a. read data_offset from where user application should write data.
> >> + * b. write data of data_size to migration region from data_offset.  
> > 
> > Whose's data_size, the _SAVING end or the _RESUMING end?  I think this
> > is intended to be the transaction size from the _SAVING source,   
> 
> Not necessarily. data_size could be MIN(transaction size of source, 
> migration data section). If migration data section is smaller than data 
> packet size at source, then it has to be broken and iteratively sent.

So you're saying that a transaction from the source is divisible by the
user under certain conditions.  What other conditions exist?  Can the
user decide arbitrary sizes less than the MIN() stated above?  This
needs to be specified.

> > but it
> > could easily be misinterpreted as reading data_size on the _RESUMING
> > end.
> >   
> >> + * c. write data_size which indicates vendor driver that data is written in
> >> + *    staging buffer. Vendor driver should read this data from migration
> >> + *    region and resume device's state.  
> > 
> > I think we also need to define the error protocol.  The user could
> > mis-order transactions or there could be an internal error in the
> > vendor driver or device.  Are all read(2)/write(2) operations
> > susceptible to defined errnos to signal this?  
> 
> Yes.

And those defined errnos are specified...

> >  Is it reflected in
> > device_state?    
> 
> No.

So a user should do what, just keep trying?
 
> > What's the recovery protocol?
> >   
> 
> On read()/write() failure user should take necessary action.

Where is that necessary action defined?  Can they just try again?  Do
they transition in and out of _RESUMING to try again?  Do they need to
reset the device?

> >> + *
> >> + * For user application, data is opaque. User should write data in the same
> >> + * order as received.  
> > 
> > Order and transaction size, ie. each data_size chunk is indivisible by
> > the user.  
> 
> Transaction size can differ, but order should remain same.

Under what circumstances and to what extent can transaction size
differ?  Is the MIN() algorithm above the absolute lower bound or just
a suggestion?  Is the user allowed to concatenate transactions from the
source together on the target if the region is sufficiently large?  It
seems like quite an imposition on the vendor driver to support this
flexibility.

> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +	__u32 device_state;         /* VFIO device state */
> >> +#define VFIO_DEVICE_STATE_STOP      (1 << 0)
> >> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)  
> > 
> > Huh?  We should probably just refer to it consistently, ie. _RUNNING
> > and !_RUNNING, otherwise we have the incongruity that setting the _STOP
> > value is actually the opposite of the necessary logic value (_STOP = 1
> > is _RUNNING, _STOP = 0 is !_RUNNING).  
> 
> Ops, my mistake, forgot to update to
> #define VFIO_DEVICE_STATE_STOP      (0)
> 
> >   
> >> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> >> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> >> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> >> +				     VFIO_DEVICE_STATE_SAVING |  \
> >> +				     VFIO_DEVICE_STATE_RESUMING)
> >> +
> >> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
> >> +					    VFIO_DEVICE_STATE_RESUMING)
> >> +
> >> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
> >> +					    VFIO_DEVICE_STATE_RESUMING)  
> > 
> > Gack, we fixed these in the last iteration!
> >   
> 
> That solution doesn't scale when new flags will be added. I still prefer 
> to define as above.

I see, the argument was buried in a reply to Yan, sorry if I missed it:

>>> These seem difficult to use, maybe we just need a
>>> VFIO_DEVICE_STATE_VALID macro?
>>>
>>> #define VFIO_DEVICE_STATE_VALID(state) \
>>>    (state & VFIO_DEVICE_STATE_RESUMING ? \
>>>    (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
>>>  
>
> This will not be work when use of other bits gets added in future. 
> That's the reason I preferred to add individual invalid states which 
> user should check.

I would argue that what doesn't scale is having numerous CASE1, CASE2,
CASEn conditions elsewhere in the kernel rather than have a unified,
single macro that defines a valid state.  How do you worry this will be
a problem when new flags are added, can't we just update the macro?
Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-17 11:47         ` Kirti Wankhede
@ 2019-12-18  1:04           ` Yan Zhao
  2019-12-18 20:05             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 44+ messages in thread
From: Yan Zhao @ 2019-12-18  1:04 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Tue, Dec 17, 2019 at 07:47:05PM +0800, Kirti Wankhede wrote:
> 
> 
> On 12/17/2019 3:21 PM, Yan Zhao wrote:
> > On Tue, Dec 17, 2019 at 05:24:14PM +0800, Kirti Wankhede wrote:
> >>
> >>
> >> On 12/17/2019 10:45 AM, Yan Zhao wrote:
> >>> On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:
> >>>> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> >>>> - Start unpinned pages dirty pages tracking while migration is active and
> >>>>     device is running, i.e. during pre-copy phase.
> >>>> - Stop unpinned pages dirty pages tracking. This is required to stop
> >>>>     unpinned dirty pages tracking if migration failed or cancelled during
> >>>>     pre-copy phase. Unpinned pages tracking is clear.
> >>>> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
> >>>>     unpinned pages information on bitmap read. This ioctl returns bitmap of
> >>>>     dirty pages, its user space application responsibility to copy content
> >>>>     of dirty pages from source to destination during migration.
> >>>>
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>> ---
> >>>>    drivers/vfio/vfio_iommu_type1.c | 210 ++++++++++++++++++++++++++++++++++++++--
> >>>>    1 file changed, 203 insertions(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>>> index 3f6b04f2334f..264449654d3f 100644
> >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>> @@ -70,6 +70,7 @@ struct vfio_iommu {
> >>>>    	unsigned int		dma_avail;
> >>>>    	bool			v2;
> >>>>    	bool			nesting;
> >>>> +	bool			dirty_page_tracking;
> >>>>    };
> >>>>    
> >>>>    struct vfio_domain {
> >>>> @@ -112,6 +113,7 @@ struct vfio_pfn {
> >>>>    	dma_addr_t		iova;		/* Device address */
> >>>>    	unsigned long		pfn;		/* Host pfn */
> >>>>    	atomic_t		ref_count;
> >>>> +	bool			unpinned;
> >>>>    };
> >>>>    
> >>>>    struct vfio_regions {
> >>>> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
> >>>>    	kfree(vpfn);
> >>>>    }
> >>>>    
> >>>> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
> >>>> +{
> >>>> +	struct rb_node *n = rb_first(&dma->pfn_list);
> >>>> +
> >>>> +	for (; n; n = rb_next(n)) {
> >>>> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> >>>> +
> >>>> +		if (warn)
> >>>> +			WARN_ON_ONCE(vpfn->unpinned);
> >>>> +
> >>>> +		if (vpfn->unpinned)
> >>>> +			vfio_remove_from_pfn_list(dma, vpfn);
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> >>>> +{
> >>>> +	struct rb_node *n = rb_first(&iommu->dma_list);
> >>>> +
> >>>> +	for (; n; n = rb_next(n)) {
> >>>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >>>> +
> >>>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> >>>> +	}
> >>>> +}
> >>>> +
> >>>>    static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> >>>>    					       unsigned long iova)
> >>>>    {
> >>>> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> >>>>    	return vpfn;
> >>>>    }
> >>>>    
> >>>> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> >>>> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
> >>>> +				  bool dirty_tracking)
> >>>>    {
> >>>>    	int ret = 0;
> >>>>    
> >>>>    	if (atomic_dec_and_test(&vpfn->ref_count)) {
> >>>>    		ret = put_pfn(vpfn->pfn, dma->prot);
> >>> if physical page here is put, it may cause problem when pin this iova
> >>> next time:
> >>> vfio_iommu_type1_pin_pages {
> >>>       ...
> >>>       vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>>       if (vpfn) {
> >>>           phys_pfn[i] = vpfn->pfn;
> >>>           continue;
> >>>       }
> >>>       ...
> >>> }
> >>>
> >>
> >> Good point. Fixing it as:
> >>
> >>                   vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>                   if (vpfn) {
> >> -                       phys_pfn[i] = vpfn->pfn;
> >> -                       continue;
> >> +                       if (vpfn->unpinned)
> >> +                               vfio_remove_from_pfn_list(dma, vpfn);
> > what about updating vpfn instead?
> > 
> 
> vfio_pin_page_external() takes care of verification checks and mem lock 
> accounting. I prefer to free existing and add new node with existing 
> functions.
> 
> >> +                       else {
> >> +                               phys_pfn[i] = vpfn->pfn;
> >> +                               continue;
> >> +                       }
> >>                   }
> >>
> >>
> >>
> >>>> -		vfio_remove_from_pfn_list(dma, vpfn);
> >>>> +		if (dirty_tracking)
> >>>> +			vpfn->unpinned = true;
> >>>> +		else
> >>>> +			vfio_remove_from_pfn_list(dma, vpfn);
> >>> so the unpinned pages before dirty page tracking is not treated as
> >>> dirty?
> >>>
> >>
> >> Yes. That's we agreed on previous version:
> >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg663157.html
> >>
> >>>>    	}
> >>>>    	return ret;
> >>>>    }
> >>>> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
> >>>>    }
> >>>>    
> >>>>    static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> >>>> -				    bool do_accounting)
> >>>> +				    bool do_accounting, bool dirty_tracking)
> >>>>    {
> >>>>    	int unlocked;
> >>>>    	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
> >>>> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> >>>>    	if (!vpfn)
> >>>>    		return 0;
> >>>>    
> >>>> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> >>>> +	if (vpfn->unpinned)
> >>>> +		return 0;
> >>>> +
> >>>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
> >>>>    
> >>>>    	if (do_accounting)
> >>>>    		vfio_lock_acct(dma, -unlocked, true);
> >>>> @@ -583,7 +618,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>    
> >>>>    		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >>>>    		if (ret) {
> >>>> -			vfio_unpin_page_external(dma, iova, do_accounting);
> >>>> +			vfio_unpin_page_external(dma, iova, do_accounting,
> >>>> +						 false);
> >>>>    			goto pin_unwind;
> >>>>    		}
> >>>>    	}
> >>>> @@ -598,7 +634,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>    
> >>>>    		iova = user_pfn[j] << PAGE_SHIFT;
> >>>>    		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >>>> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >>>> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
> >>>>    		phys_pfn[j] = 0;
> >>>>    	}
> >>>>    pin_done:
> >>>> @@ -632,7 +668,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
> >>>>    		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >>>>    		if (!dma)
> >>>>    			goto unpin_exit;
> >>>> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >>>> +		vfio_unpin_page_external(dma, iova, do_accounting,
> >>>> +					 iommu->dirty_page_tracking);
> >>>>    	}
> >>>>    
> >>>>    unpin_exit:
> >>>> @@ -850,6 +887,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >>>>    	return bitmap;
> >>>>    }
> >>>>    
> >>>> +/*
> >>>> + * start_iova is the reference from where bitmaping started. This is called
> >>>> + * from DMA_UNMAP where start_iova can be different than iova
> >>>> + */
> >>>> +
> >>>> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> >>>> +				  size_t size, uint64_t pgsize,
> >>>> +				  dma_addr_t start_iova, unsigned long *bitmap)
> >>>> +{
> >>>> +	struct vfio_dma *dma;
> >>>> +	dma_addr_t i = iova;
> >>>> +	unsigned long pgshift = __ffs(pgsize);
> >>>> +
> >>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> >>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
> >>>> +		if (dma->iommu_mapped) {
> >>> This prevents pass-through devices from calling vfio_pin_pages to do
> >>> fine grained log dirty.
> >>
> >> Yes, I mentioned that in yet TODO item in cover letter:
> >>
> >> "If IOMMU capable device is present in the container, then all pages are
> >> marked dirty. Need to think smart way to know if IOMMU capable device's
> >> driver is smart to report pages to be marked dirty by pinning those
> >> pages externally."
> >>
> > why not just check first if any vpfn present for IOMMU capable devices?
> > 
> 
> vfio_pin_pages(dev, ...) calls driver->ops->pin_pages(iommu, ...)
> 
> In vfio_iommu_type1 module, vfio_iommu_type1_pin_pages() doesn't know 
> the device. vpfn are tracked against container->iommu, not against 
> device. Need to think of smart way to know if devices in container are 
> all smart which report pages dirty ny pinning those pages manually.
>
I believe in such case, the mdev on top of device is in the same iommu
group (i.e. 1:1 mdev on top of device).
device vendor driver calls vfio_pin_pages to notify vfio which pages are dirty. 
> 
> >>
> >>>> +			dma_addr_t iova_limit;
> >>>> +
> >>>> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
> >>>> +				     (dma->iova + dma->size) : (iova + size);
> >>>> +
> >>>> +			for (; i < iova_limit; i += pgsize) {
> >>>> +				unsigned int start;
> >>>> +
> >>>> +				start = (i - start_iova) >> pgshift;
> >>>> +
> >>>> +				__bitmap_set(bitmap, start, 1);
> >>>> +			}
> >>>> +			if (i >= iova + size)
> >>>> +				return;
> >>>> +		} else {
> >>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
> >>>> +			bool found = false;
> >>>> +
> >>>> +			for (; n; n = rb_next(n)) {
> >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>> +							struct vfio_pfn, node);
> >>>> +				if (vpfn->iova >= i) {
> >>>> +					found = true;
> >>>> +					break;
> >>>> +				}
> >>>> +			}
> >>>> +
> >>>> +			if (!found) {
> >>>> +				i += dma->size;
> >>>> +				continue;
> >>>> +			}
> >>>> +
> >>>> +			for (; n; n = rb_next(n)) {
> >>>> +				unsigned int start;
> >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>> +							struct vfio_pfn, node);
> >>>> +
> >>>> +				if (vpfn->iova >= iova + size)
> >>>> +					return;
> >>>> +
> >>>> +				start = (vpfn->iova - start_iova) >> pgshift;
> >>>> +
> >>>> +				__bitmap_set(bitmap, start, 1);
> >>>> +
> >>>> +				i = vpfn->iova + pgsize;
> >>>> +			}
> >>>> +		}
> >>>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> >>>> +{
> >>>> +	long bsize;
> >>>> +
> >>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> >>>> +
> >>>> +	if (bitmap_size < bsize)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	return bsize;
> >>>> +}
> >>>> +
> >>>>    static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>>>    			     struct vfio_iommu_type1_dma_unmap *unmap)
> >>>>    {
> >>>> @@ -2298,6 +2417,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>>>    
> >>>>    		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>>>    			-EFAULT : 0;
> >>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> >>>> +		struct vfio_iommu_type1_dirty_bitmap range;
> >>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> >>>> +		int ret;
> >>>> +
> >>>> +		if (!iommu->v2)
> >>>> +			return -EACCES;
> >>>> +
> >>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> >>>> +				    bitmap);
> >>>> +
> >>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> >>>> +			return -EFAULT;
> >>>> +
> >>>> +		if (range.argsz < minsz || range.flags & ~mask)
> >>>> +			return -EINVAL;
> >>>> +
> >>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >>>> +			iommu->dirty_page_tracking = true;
> >>>> +			return 0;
> >>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >>>> +			iommu->dirty_page_tracking = false;
> >>>> +
> >>>> +			mutex_lock(&iommu->lock);
> >>>> +			vfio_remove_unpinned_from_dma_list(iommu);
> >>>> +			mutex_unlock(&iommu->lock);
> >>>> +			return 0;
> >>>> +
> >>>> +		} else if (range.flags &
> >>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >>>> +			uint64_t iommu_pgmask;
> >>>> +			unsigned long pgshift = __ffs(range.pgsize);
> >>>> +			unsigned long *bitmap;
> >>>> +			long bsize;
> >>>> +
> >>>> +			iommu_pgmask =
> >>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >>>> +
> >>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> >>>> +			    (range.pgsize - 1))
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			if (range.iova & iommu_pgmask)
> >>>> +				return -EINVAL;
> >>>> +			if (!range.size || range.size > SIZE_MAX)
> >>>> +				return -EINVAL;
> >>>> +			if (range.iova + range.size < range.iova)
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >>>> +						   range.bitmap_size);
> >>>> +			if (bsize)
> >>>> +				return ret;
> >>>> +
> >>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> >>>> +			if (!bitmap)
> >>>> +				return -ENOMEM;
> >>>> +
> >>>> +			ret = copy_from_user(bitmap,
> >>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> >>>> +			if (ret)
> >>>> +				goto bitmap_exit;
> >>>> +
> >>>> +			iommu->dirty_page_tracking = false;
> >>> why iommu->dirty_page_tracking is false here?
> >>> suppose this ioctl can be called several times.
> >>>
> >>
> >> This ioctl can be called several times, but once this ioctl is called
> >> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
> >> stop-and-copy phase) and dirty pages bitmap are being queried by user.
> >>
> > can't agree that VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP can only be
> > called in stop-and-copy phase.
> > As stated in last version, this will cause QEMU to get a wrong expectation
> > of VM downtime and this is also the reason for previously pinned pages
> > before log_sync cannot be treated as dirty. If this get bitmap ioctl can
> > be called early in save_setup phase, then it's no problem even all ram
> > is dirty.
> > 
> 
> Device can also write to pages which are pinned, and then there is no 
> way to know pages dirtied by device during pre-copy phase.
> If user ask dirty bitmap in per-copy phase, even then user will have to 
> query dirty bitmap in stop-and-copy phase where this will be superset 
> including all pages reported during pre-copy. Then instead of copying 
> all pages twice, its better to do it once during stop-and-copy phase.
>
I think the flow should be like this:
1. save_setup --> GET_BITMAP ioctl --> return bitmap for currently + previously
pinned pages and clean all previously pinned pages

2. save_pending --> GET_BITMAP ioctl  --> return bitmap of (currently
pinned pages + previously pinned pages since last clean) and clean all
previously pinned pages

3. save_complete_precopy --> GET_BITMAP ioctl --> return bitmap of (currently
pinned pages + previously pinned pages since last clean) and clean all
previously pinned pages


Copying pinned pages multiple times is unavoidable because those pinned pages
are always treated as dirty. That's per vendor's implementation.
But if the pinned pages are not reported as dirty before stop-and-copy phase,
QEMU would think dirty pages has converged
and enter blackout phase, making downtime_limit severely incorrect.

Thanks
Yan

> >>>> +			mutex_lock(&iommu->lock);
> >>>> +			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
> >>>> +					     range.pgsize, range.iova, bitmap);
> >>>> +			mutex_unlock(&iommu->lock);
> >>>> +
> >>>> +			ret = copy_to_user((void __user *)range.bitmap, bitmap,
> >>>> +					   range.bitmap_size) ? -EFAULT : 0;
> >>>> +bitmap_exit:
> >>>> +			kfree(bitmap);
> >>>> +			return ret;
> >>>> +		}
> >>>>    	}
> >>>>    
> >>>>    	return -ENOTTY;
> >>>> -- 
> >>>> 2.7.0
> >>>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-18  1:04           ` Yan Zhao
@ 2019-12-18 20:05             ` Dr. David Alan Gilbert
  2019-12-19  0:57               ` Yan Zhao
  0 siblings, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2019-12-18 20:05 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kirti Wankhede, alex.williamson, cjia, Tian, Kevin, Yang, Ziye,
	Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, Ken.Xue, Wang,
	Zhi A, qemu-devel, kvm

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> On Tue, Dec 17, 2019 at 07:47:05PM +0800, Kirti Wankhede wrote:
> > 
> > 
> > On 12/17/2019 3:21 PM, Yan Zhao wrote:
> > > On Tue, Dec 17, 2019 at 05:24:14PM +0800, Kirti Wankhede wrote:
> > >>
> > >>
> > >> On 12/17/2019 10:45 AM, Yan Zhao wrote:
> > >>> On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:
> > >>>> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > >>>> - Start unpinned pages dirty pages tracking while migration is active and
> > >>>>     device is running, i.e. during pre-copy phase.
> > >>>> - Stop unpinned pages dirty pages tracking. This is required to stop
> > >>>>     unpinned dirty pages tracking if migration failed or cancelled during
> > >>>>     pre-copy phase. Unpinned pages tracking is clear.
> > >>>> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
> > >>>>     unpinned pages information on bitmap read. This ioctl returns bitmap of
> > >>>>     dirty pages, its user space application responsibility to copy content
> > >>>>     of dirty pages from source to destination during migration.
> > >>>>
> > >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >>>> ---
> > >>>>    drivers/vfio/vfio_iommu_type1.c | 210 ++++++++++++++++++++++++++++++++++++++--
> > >>>>    1 file changed, 203 insertions(+), 7 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > >>>> index 3f6b04f2334f..264449654d3f 100644
> > >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> > >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> > >>>> @@ -70,6 +70,7 @@ struct vfio_iommu {
> > >>>>    	unsigned int		dma_avail;
> > >>>>    	bool			v2;
> > >>>>    	bool			nesting;
> > >>>> +	bool			dirty_page_tracking;
> > >>>>    };
> > >>>>    
> > >>>>    struct vfio_domain {
> > >>>> @@ -112,6 +113,7 @@ struct vfio_pfn {
> > >>>>    	dma_addr_t		iova;		/* Device address */
> > >>>>    	unsigned long		pfn;		/* Host pfn */
> > >>>>    	atomic_t		ref_count;
> > >>>> +	bool			unpinned;
> > >>>>    };
> > >>>>    
> > >>>>    struct vfio_regions {
> > >>>> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
> > >>>>    	kfree(vpfn);
> > >>>>    }
> > >>>>    
> > >>>> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
> > >>>> +{
> > >>>> +	struct rb_node *n = rb_first(&dma->pfn_list);
> > >>>> +
> > >>>> +	for (; n; n = rb_next(n)) {
> > >>>> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> > >>>> +
> > >>>> +		if (warn)
> > >>>> +			WARN_ON_ONCE(vpfn->unpinned);
> > >>>> +
> > >>>> +		if (vpfn->unpinned)
> > >>>> +			vfio_remove_from_pfn_list(dma, vpfn);
> > >>>> +	}
> > >>>> +}
> > >>>> +
> > >>>> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> > >>>> +{
> > >>>> +	struct rb_node *n = rb_first(&iommu->dma_list);
> > >>>> +
> > >>>> +	for (; n; n = rb_next(n)) {
> > >>>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> > >>>> +
> > >>>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> > >>>> +	}
> > >>>> +}
> > >>>> +
> > >>>>    static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> > >>>>    					       unsigned long iova)
> > >>>>    {
> > >>>> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> > >>>>    	return vpfn;
> > >>>>    }
> > >>>>    
> > >>>> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> > >>>> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
> > >>>> +				  bool dirty_tracking)
> > >>>>    {
> > >>>>    	int ret = 0;
> > >>>>    
> > >>>>    	if (atomic_dec_and_test(&vpfn->ref_count)) {
> > >>>>    		ret = put_pfn(vpfn->pfn, dma->prot);
> > >>> if physical page here is put, it may cause problem when pin this iova
> > >>> next time:
> > >>> vfio_iommu_type1_pin_pages {
> > >>>       ...
> > >>>       vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> > >>>       if (vpfn) {
> > >>>           phys_pfn[i] = vpfn->pfn;
> > >>>           continue;
> > >>>       }
> > >>>       ...
> > >>> }
> > >>>
> > >>
> > >> Good point. Fixing it as:
> > >>
> > >>                   vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> > >>                   if (vpfn) {
> > >> -                       phys_pfn[i] = vpfn->pfn;
> > >> -                       continue;
> > >> +                       if (vpfn->unpinned)
> > >> +                               vfio_remove_from_pfn_list(dma, vpfn);
> > > what about updating vpfn instead?
> > > 
> > 
> > vfio_pin_page_external() takes care of verification checks and mem lock 
> > accounting. I prefer to free existing and add new node with existing 
> > functions.
> > 
> > >> +                       else {
> > >> +                               phys_pfn[i] = vpfn->pfn;
> > >> +                               continue;
> > >> +                       }
> > >>                   }
> > >>
> > >>
> > >>
> > >>>> -		vfio_remove_from_pfn_list(dma, vpfn);
> > >>>> +		if (dirty_tracking)
> > >>>> +			vpfn->unpinned = true;
> > >>>> +		else
> > >>>> +			vfio_remove_from_pfn_list(dma, vpfn);
> > >>> so the unpinned pages before dirty page tracking is not treated as
> > >>> dirty?
> > >>>
> > >>
> > >> Yes. That's we agreed on previous version:
> > >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg663157.html
> > >>
> > >>>>    	}
> > >>>>    	return ret;
> > >>>>    }
> > >>>> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
> > >>>>    }
> > >>>>    
> > >>>>    static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> > >>>> -				    bool do_accounting)
> > >>>> +				    bool do_accounting, bool dirty_tracking)
> > >>>>    {
> > >>>>    	int unlocked;
> > >>>>    	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
> > >>>> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> > >>>>    	if (!vpfn)
> > >>>>    		return 0;
> > >>>>    
> > >>>> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> > >>>> +	if (vpfn->unpinned)
> > >>>> +		return 0;
> > >>>> +
> > >>>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
> > >>>>    
> > >>>>    	if (do_accounting)
> > >>>>    		vfio_lock_acct(dma, -unlocked, true);
> > >>>> @@ -583,7 +618,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> > >>>>    
> > >>>>    		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> > >>>>    		if (ret) {
> > >>>> -			vfio_unpin_page_external(dma, iova, do_accounting);
> > >>>> +			vfio_unpin_page_external(dma, iova, do_accounting,
> > >>>> +						 false);
> > >>>>    			goto pin_unwind;
> > >>>>    		}
> > >>>>    	}
> > >>>> @@ -598,7 +634,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> > >>>>    
> > >>>>    		iova = user_pfn[j] << PAGE_SHIFT;
> > >>>>    		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> > >>>> -		vfio_unpin_page_external(dma, iova, do_accounting);
> > >>>> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
> > >>>>    		phys_pfn[j] = 0;
> > >>>>    	}
> > >>>>    pin_done:
> > >>>> @@ -632,7 +668,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
> > >>>>    		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> > >>>>    		if (!dma)
> > >>>>    			goto unpin_exit;
> > >>>> -		vfio_unpin_page_external(dma, iova, do_accounting);
> > >>>> +		vfio_unpin_page_external(dma, iova, do_accounting,
> > >>>> +					 iommu->dirty_page_tracking);
> > >>>>    	}
> > >>>>    
> > >>>>    unpin_exit:
> > >>>> @@ -850,6 +887,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > >>>>    	return bitmap;
> > >>>>    }
> > >>>>    
> > >>>> +/*
> > >>>> + * start_iova is the reference from where bitmaping started. This is called
> > >>>> + * from DMA_UNMAP where start_iova can be different than iova
> > >>>> + */
> > >>>> +
> > >>>> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > >>>> +				  size_t size, uint64_t pgsize,
> > >>>> +				  dma_addr_t start_iova, unsigned long *bitmap)
> > >>>> +{
> > >>>> +	struct vfio_dma *dma;
> > >>>> +	dma_addr_t i = iova;
> > >>>> +	unsigned long pgshift = __ffs(pgsize);
> > >>>> +
> > >>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > >>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
> > >>>> +		if (dma->iommu_mapped) {
> > >>> This prevents pass-through devices from calling vfio_pin_pages to do
> > >>> fine grained log dirty.
> > >>
> > >> Yes, I mentioned that in yet TODO item in cover letter:
> > >>
> > >> "If IOMMU capable device is present in the container, then all pages are
> > >> marked dirty. Need to think smart way to know if IOMMU capable device's
> > >> driver is smart to report pages to be marked dirty by pinning those
> > >> pages externally."
> > >>
> > > why not just check first if any vpfn present for IOMMU capable devices?
> > > 
> > 
> > vfio_pin_pages(dev, ...) calls driver->ops->pin_pages(iommu, ...)
> > 
> > In vfio_iommu_type1 module, vfio_iommu_type1_pin_pages() doesn't know 
> > the device. vpfn are tracked against container->iommu, not against 
> > device. Need to think of smart way to know if devices in container are 
> > all smart which report pages dirty ny pinning those pages manually.
> >
> I believe in such case, the mdev on top of device is in the same iommu
> group (i.e. 1:1 mdev on top of device).
> device vendor driver calls vfio_pin_pages to notify vfio which pages are dirty. 
> > 
> > >>
> > >>>> +			dma_addr_t iova_limit;
> > >>>> +
> > >>>> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
> > >>>> +				     (dma->iova + dma->size) : (iova + size);
> > >>>> +
> > >>>> +			for (; i < iova_limit; i += pgsize) {
> > >>>> +				unsigned int start;
> > >>>> +
> > >>>> +				start = (i - start_iova) >> pgshift;
> > >>>> +
> > >>>> +				__bitmap_set(bitmap, start, 1);
> > >>>> +			}
> > >>>> +			if (i >= iova + size)
> > >>>> +				return;
> > >>>> +		} else {
> > >>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
> > >>>> +			bool found = false;
> > >>>> +
> > >>>> +			for (; n; n = rb_next(n)) {
> > >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> > >>>> +							struct vfio_pfn, node);
> > >>>> +				if (vpfn->iova >= i) {
> > >>>> +					found = true;
> > >>>> +					break;
> > >>>> +				}
> > >>>> +			}
> > >>>> +
> > >>>> +			if (!found) {
> > >>>> +				i += dma->size;
> > >>>> +				continue;
> > >>>> +			}
> > >>>> +
> > >>>> +			for (; n; n = rb_next(n)) {
> > >>>> +				unsigned int start;
> > >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> > >>>> +							struct vfio_pfn, node);
> > >>>> +
> > >>>> +				if (vpfn->iova >= iova + size)
> > >>>> +					return;
> > >>>> +
> > >>>> +				start = (vpfn->iova - start_iova) >> pgshift;
> > >>>> +
> > >>>> +				__bitmap_set(bitmap, start, 1);
> > >>>> +
> > >>>> +				i = vpfn->iova + pgsize;
> > >>>> +			}
> > >>>> +		}
> > >>>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> > >>>> +	}
> > >>>> +}
> > >>>> +
> > >>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> > >>>> +{
> > >>>> +	long bsize;
> > >>>> +
> > >>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> > >>>> +		return -EINVAL;
> > >>>> +
> > >>>> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> > >>>> +
> > >>>> +	if (bitmap_size < bsize)
> > >>>> +		return -EINVAL;
> > >>>> +
> > >>>> +	return bsize;
> > >>>> +}
> > >>>> +
> > >>>>    static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >>>>    			     struct vfio_iommu_type1_dma_unmap *unmap)
> > >>>>    {
> > >>>> @@ -2298,6 +2417,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >>>>    
> > >>>>    		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > >>>>    			-EFAULT : 0;
> > >>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> > >>>> +		struct vfio_iommu_type1_dirty_bitmap range;
> > >>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> > >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> > >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> > >>>> +		int ret;
> > >>>> +
> > >>>> +		if (!iommu->v2)
> > >>>> +			return -EACCES;
> > >>>> +
> > >>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> > >>>> +				    bitmap);
> > >>>> +
> > >>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> > >>>> +			return -EFAULT;
> > >>>> +
> > >>>> +		if (range.argsz < minsz || range.flags & ~mask)
> > >>>> +			return -EINVAL;
> > >>>> +
> > >>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> > >>>> +			iommu->dirty_page_tracking = true;
> > >>>> +			return 0;
> > >>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> > >>>> +			iommu->dirty_page_tracking = false;
> > >>>> +
> > >>>> +			mutex_lock(&iommu->lock);
> > >>>> +			vfio_remove_unpinned_from_dma_list(iommu);
> > >>>> +			mutex_unlock(&iommu->lock);
> > >>>> +			return 0;
> > >>>> +
> > >>>> +		} else if (range.flags &
> > >>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> > >>>> +			uint64_t iommu_pgmask;
> > >>>> +			unsigned long pgshift = __ffs(range.pgsize);
> > >>>> +			unsigned long *bitmap;
> > >>>> +			long bsize;
> > >>>> +
> > >>>> +			iommu_pgmask =
> > >>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > >>>> +
> > >>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> > >>>> +			    (range.pgsize - 1))
> > >>>> +				return -EINVAL;
> > >>>> +
> > >>>> +			if (range.iova & iommu_pgmask)
> > >>>> +				return -EINVAL;
> > >>>> +			if (!range.size || range.size > SIZE_MAX)
> > >>>> +				return -EINVAL;
> > >>>> +			if (range.iova + range.size < range.iova)
> > >>>> +				return -EINVAL;
> > >>>> +
> > >>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> > >>>> +						   range.bitmap_size);
> > >>>> +			if (bsize)
> > >>>> +				return ret;
> > >>>> +
> > >>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> > >>>> +			if (!bitmap)
> > >>>> +				return -ENOMEM;
> > >>>> +
> > >>>> +			ret = copy_from_user(bitmap,
> > >>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> > >>>> +			if (ret)
> > >>>> +				goto bitmap_exit;
> > >>>> +
> > >>>> +			iommu->dirty_page_tracking = false;
> > >>> why iommu->dirty_page_tracking is false here?
> > >>> suppose this ioctl can be called several times.
> > >>>
> > >>
> > >> This ioctl can be called several times, but once this ioctl is called
> > >> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
> > >> stop-and-copy phase) and dirty pages bitmap are being queried by user.
> > >>
> > > can't agree that VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP can only be
> > > called in stop-and-copy phase.
> > > As stated in last version, this will cause QEMU to get a wrong expectation
> > > of VM downtime and this is also the reason for previously pinned pages
> > > before log_sync cannot be treated as dirty. If this get bitmap ioctl can
> > > be called early in save_setup phase, then it's no problem even all ram
> > > is dirty.
> > > 
> > 
> > Device can also write to pages which are pinned, and then there is no 
> > way to know pages dirtied by device during pre-copy phase.
> > If user ask dirty bitmap in per-copy phase, even then user will have to 
> > query dirty bitmap in stop-and-copy phase where this will be superset 
> > including all pages reported during pre-copy. Then instead of copying 
> > all pages twice, its better to do it once during stop-and-copy phase.
> >
> I think the flow should be like this:
> 1. save_setup --> GET_BITMAP ioctl --> return bitmap for currently + previously
> pinned pages and clean all previously pinned pages
> 
> 2. save_pending --> GET_BITMAP ioctl  --> return bitmap of (currently
> pinned pages + previously pinned pages since last clean) and clean all
> previously pinned pages
> 
> 3. save_complete_precopy --> GET_BITMAP ioctl --> return bitmap of (currently
> pinned pages + previously pinned pages since last clean) and clean all
> previously pinned pages
> 
> 
> Copying pinned pages multiple times is unavoidable because those pinned pages
> are always treated as dirty. That's per vendor's implementation.
> But if the pinned pages are not reported as dirty before stop-and-copy phase,
> QEMU would think dirty pages has converged
> and enter blackout phase, making downtime_limit severely incorrect.

I'm not sure it's any worse.
I *think* we do a last sync after we've decided to go to stop-and-copy;
wont that then mark all those pages as dirty again, so it'll have the
same behaviour?
Anyway, it seems wrong to repeatedly send pages that you know are
pointless - but that probably means we need a way to mark those somehow
to avoid it.

Dave

> Thanks
> Yan
> 
> > >>>> +			mutex_lock(&iommu->lock);
> > >>>> +			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
> > >>>> +					     range.pgsize, range.iova, bitmap);
> > >>>> +			mutex_unlock(&iommu->lock);
> > >>>> +
> > >>>> +			ret = copy_to_user((void __user *)range.bitmap, bitmap,
> > >>>> +					   range.bitmap_size) ? -EFAULT : 0;
> > >>>> +bitmap_exit:
> > >>>> +			kfree(bitmap);
> > >>>> +			return ret;
> > >>>> +		}
> > >>>>    	}
> > >>>>    
> > >>>>    	return -ENOTTY;
> > >>>> -- 
> > >>>> 2.7.0
> > >>>>
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-17  9:24     ` Kirti Wankhede
  2019-12-17  9:51       ` Yan Zhao
@ 2019-12-18 21:39       ` Alex Williamson
  2019-12-19 18:42         ` Kirti Wankhede
  1 sibling, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2019-12-18 21:39 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Yan Zhao, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu,
	Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Tue, 17 Dec 2019 14:54:14 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/17/2019 10:45 AM, Yan Zhao wrote:
> > On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:  
> >> +		} else if (range.flags &
> >> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >> +			uint64_t iommu_pgmask;
> >> +			unsigned long pgshift = __ffs(range.pgsize);
> >> +			unsigned long *bitmap;
> >> +			long bsize;
> >> +
> >> +			iommu_pgmask =
> >> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >> +
> >> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> >> +			    (range.pgsize - 1))
> >> +				return -EINVAL;
> >> +
> >> +			if (range.iova & iommu_pgmask)
> >> +				return -EINVAL;
> >> +			if (!range.size || range.size > SIZE_MAX)
> >> +				return -EINVAL;
> >> +			if (range.iova + range.size < range.iova)
> >> +				return -EINVAL;
> >> +
> >> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >> +						   range.bitmap_size);
> >> +			if (bsize)
> >> +				return ret;
> >> +
> >> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> >> +			if (!bitmap)
> >> +				return -ENOMEM;
> >> +
> >> +			ret = copy_from_user(bitmap,
> >> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> >> +			if (ret)
> >> +				goto bitmap_exit;
> >> +
> >> +			iommu->dirty_page_tracking = false;  
> > why iommu->dirty_page_tracking is false here?
> > suppose this ioctl can be called several times.
> >   
> 
> This ioctl can be called several times, but once this ioctl is called 
> that means vCPUs are stopped and VFIO devices are stopped (i.e. in 
> stop-and-copy phase) and dirty pages bitmap are being queried by user.

Do not assume how userspace works or its intent.  If dirty tracking is
on, it should remain on until the user turns it off.  We cannot assume
userspace uses a one-shot approach.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-18 20:05             ` Dr. David Alan Gilbert
@ 2019-12-19  0:57               ` Yan Zhao
  2019-12-19 16:21                 ` Kirti Wankhede
  0 siblings, 1 reply; 44+ messages in thread
From: Yan Zhao @ 2019-12-19  0:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kirti Wankhede, alex.williamson, cjia, Tian, Kevin, Yang, Ziye,
	Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, Ken.Xue, Wang,
	Zhi A, qemu-devel, kvm

On Thu, Dec 19, 2019 at 04:05:52AM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > On Tue, Dec 17, 2019 at 07:47:05PM +0800, Kirti Wankhede wrote:
> > > 
> > > 
> > > On 12/17/2019 3:21 PM, Yan Zhao wrote:
> > > > On Tue, Dec 17, 2019 at 05:24:14PM +0800, Kirti Wankhede wrote:
> > > >>
> > > >>
> > > >> On 12/17/2019 10:45 AM, Yan Zhao wrote:
> > > >>> On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:
> > > >>>> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > > >>>> - Start unpinned pages dirty pages tracking while migration is active and
> > > >>>>     device is running, i.e. during pre-copy phase.
> > > >>>> - Stop unpinned pages dirty pages tracking. This is required to stop
> > > >>>>     unpinned dirty pages tracking if migration failed or cancelled during
> > > >>>>     pre-copy phase. Unpinned pages tracking is clear.
> > > >>>> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
> > > >>>>     unpinned pages information on bitmap read. This ioctl returns bitmap of
> > > >>>>     dirty pages, its user space application responsibility to copy content
> > > >>>>     of dirty pages from source to destination during migration.
> > > >>>>
> > > >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > >>>> ---
> > > >>>>    drivers/vfio/vfio_iommu_type1.c | 210 ++++++++++++++++++++++++++++++++++++++--
> > > >>>>    1 file changed, 203 insertions(+), 7 deletions(-)
> > > >>>>
> > > >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > >>>> index 3f6b04f2334f..264449654d3f 100644
> > > >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> > > >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> > > >>>> @@ -70,6 +70,7 @@ struct vfio_iommu {
> > > >>>>    	unsigned int		dma_avail;
> > > >>>>    	bool			v2;
> > > >>>>    	bool			nesting;
> > > >>>> +	bool			dirty_page_tracking;
> > > >>>>    };
> > > >>>>    
> > > >>>>    struct vfio_domain {
> > > >>>> @@ -112,6 +113,7 @@ struct vfio_pfn {
> > > >>>>    	dma_addr_t		iova;		/* Device address */
> > > >>>>    	unsigned long		pfn;		/* Host pfn */
> > > >>>>    	atomic_t		ref_count;
> > > >>>> +	bool			unpinned;
> > > >>>>    };
> > > >>>>    
> > > >>>>    struct vfio_regions {
> > > >>>> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
> > > >>>>    	kfree(vpfn);
> > > >>>>    }
> > > >>>>    
> > > >>>> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
> > > >>>> +{
> > > >>>> +	struct rb_node *n = rb_first(&dma->pfn_list);
> > > >>>> +
> > > >>>> +	for (; n; n = rb_next(n)) {
> > > >>>> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> > > >>>> +
> > > >>>> +		if (warn)
> > > >>>> +			WARN_ON_ONCE(vpfn->unpinned);
> > > >>>> +
> > > >>>> +		if (vpfn->unpinned)
> > > >>>> +			vfio_remove_from_pfn_list(dma, vpfn);
> > > >>>> +	}
> > > >>>> +}
> > > >>>> +
> > > >>>> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> > > >>>> +{
> > > >>>> +	struct rb_node *n = rb_first(&iommu->dma_list);
> > > >>>> +
> > > >>>> +	for (; n; n = rb_next(n)) {
> > > >>>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> > > >>>> +
> > > >>>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> > > >>>> +	}
> > > >>>> +}
> > > >>>> +
> > > >>>>    static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> > > >>>>    					       unsigned long iova)
> > > >>>>    {
> > > >>>> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> > > >>>>    	return vpfn;
> > > >>>>    }
> > > >>>>    
> > > >>>> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> > > >>>> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
> > > >>>> +				  bool dirty_tracking)
> > > >>>>    {
> > > >>>>    	int ret = 0;
> > > >>>>    
> > > >>>>    	if (atomic_dec_and_test(&vpfn->ref_count)) {
> > > >>>>    		ret = put_pfn(vpfn->pfn, dma->prot);
> > > >>> if physical page here is put, it may cause problem when pin this iova
> > > >>> next time:
> > > >>> vfio_iommu_type1_pin_pages {
> > > >>>       ...
> > > >>>       vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> > > >>>       if (vpfn) {
> > > >>>           phys_pfn[i] = vpfn->pfn;
> > > >>>           continue;
> > > >>>       }
> > > >>>       ...
> > > >>> }
> > > >>>
> > > >>
> > > >> Good point. Fixing it as:
> > > >>
> > > >>                   vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> > > >>                   if (vpfn) {
> > > >> -                       phys_pfn[i] = vpfn->pfn;
> > > >> -                       continue;
> > > >> +                       if (vpfn->unpinned)
> > > >> +                               vfio_remove_from_pfn_list(dma, vpfn);
> > > > what about updating vpfn instead?
> > > > 
> > > 
> > > vfio_pin_page_external() takes care of verification checks and mem lock 
> > > accounting. I prefer to free existing and add new node with existing 
> > > functions.
> > > 
> > > >> +                       else {
> > > >> +                               phys_pfn[i] = vpfn->pfn;
> > > >> +                               continue;
> > > >> +                       }
> > > >>                   }
> > > >>
> > > >>
> > > >>
> > > >>>> -		vfio_remove_from_pfn_list(dma, vpfn);
> > > >>>> +		if (dirty_tracking)
> > > >>>> +			vpfn->unpinned = true;
> > > >>>> +		else
> > > >>>> +			vfio_remove_from_pfn_list(dma, vpfn);
> > > >>> so the unpinned pages before dirty page tracking is not treated as
> > > >>> dirty?
> > > >>>
> > > >>
> > > >> Yes. That's we agreed on previous version:
> > > >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg663157.html
> > > >>
> > > >>>>    	}
> > > >>>>    	return ret;
> > > >>>>    }
> > > >>>> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
> > > >>>>    }
> > > >>>>    
> > > >>>>    static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> > > >>>> -				    bool do_accounting)
> > > >>>> +				    bool do_accounting, bool dirty_tracking)
> > > >>>>    {
> > > >>>>    	int unlocked;
> > > >>>>    	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
> > > >>>> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> > > >>>>    	if (!vpfn)
> > > >>>>    		return 0;
> > > >>>>    
> > > >>>> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> > > >>>> +	if (vpfn->unpinned)
> > > >>>> +		return 0;
> > > >>>> +
> > > >>>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
> > > >>>>    
> > > >>>>    	if (do_accounting)
> > > >>>>    		vfio_lock_acct(dma, -unlocked, true);
> > > >>>> @@ -583,7 +618,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> > > >>>>    
> > > >>>>    		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> > > >>>>    		if (ret) {
> > > >>>> -			vfio_unpin_page_external(dma, iova, do_accounting);
> > > >>>> +			vfio_unpin_page_external(dma, iova, do_accounting,
> > > >>>> +						 false);
> > > >>>>    			goto pin_unwind;
> > > >>>>    		}
> > > >>>>    	}
> > > >>>> @@ -598,7 +634,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> > > >>>>    
> > > >>>>    		iova = user_pfn[j] << PAGE_SHIFT;
> > > >>>>    		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> > > >>>> -		vfio_unpin_page_external(dma, iova, do_accounting);
> > > >>>> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
> > > >>>>    		phys_pfn[j] = 0;
> > > >>>>    	}
> > > >>>>    pin_done:
> > > >>>> @@ -632,7 +668,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
> > > >>>>    		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> > > >>>>    		if (!dma)
> > > >>>>    			goto unpin_exit;
> > > >>>> -		vfio_unpin_page_external(dma, iova, do_accounting);
> > > >>>> +		vfio_unpin_page_external(dma, iova, do_accounting,
> > > >>>> +					 iommu->dirty_page_tracking);
> > > >>>>    	}
> > > >>>>    
> > > >>>>    unpin_exit:
> > > >>>> @@ -850,6 +887,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > > >>>>    	return bitmap;
> > > >>>>    }
> > > >>>>    
> > > >>>> +/*
> > > >>>> + * start_iova is the reference from where bitmaping started. This is called
> > > >>>> + * from DMA_UNMAP where start_iova can be different than iova
> > > >>>> + */
> > > >>>> +
> > > >>>> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > > >>>> +				  size_t size, uint64_t pgsize,
> > > >>>> +				  dma_addr_t start_iova, unsigned long *bitmap)
> > > >>>> +{
> > > >>>> +	struct vfio_dma *dma;
> > > >>>> +	dma_addr_t i = iova;
> > > >>>> +	unsigned long pgshift = __ffs(pgsize);
> > > >>>> +
> > > >>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > > >>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
> > > >>>> +		if (dma->iommu_mapped) {
> > > >>> This prevents pass-through devices from calling vfio_pin_pages to do
> > > >>> fine grained log dirty.
> > > >>
> > > >> Yes, I mentioned that in yet TODO item in cover letter:
> > > >>
> > > >> "If IOMMU capable device is present in the container, then all pages are
> > > >> marked dirty. Need to think smart way to know if IOMMU capable device's
> > > >> driver is smart to report pages to be marked dirty by pinning those
> > > >> pages externally."
> > > >>
> > > > why not just check first if any vpfn present for IOMMU capable devices?
> > > > 
> > > 
> > > vfio_pin_pages(dev, ...) calls driver->ops->pin_pages(iommu, ...)
> > > 
> > > In vfio_iommu_type1 module, vfio_iommu_type1_pin_pages() doesn't know 
> > > the device. vpfn are tracked against container->iommu, not against 
> > > device. Need to think of smart way to know if devices in container are 
> > > all smart which report pages dirty ny pinning those pages manually.
> > >
> > I believe in such case, the mdev on top of device is in the same iommu
> > group (i.e. 1:1 mdev on top of device).
> > device vendor driver calls vfio_pin_pages to notify vfio which pages are dirty. 
> > > 
> > > >>
> > > >>>> +			dma_addr_t iova_limit;
> > > >>>> +
> > > >>>> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
> > > >>>> +				     (dma->iova + dma->size) : (iova + size);
> > > >>>> +
> > > >>>> +			for (; i < iova_limit; i += pgsize) {
> > > >>>> +				unsigned int start;
> > > >>>> +
> > > >>>> +				start = (i - start_iova) >> pgshift;
> > > >>>> +
> > > >>>> +				__bitmap_set(bitmap, start, 1);
> > > >>>> +			}
> > > >>>> +			if (i >= iova + size)
> > > >>>> +				return;
> > > >>>> +		} else {
> > > >>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
> > > >>>> +			bool found = false;
> > > >>>> +
> > > >>>> +			for (; n; n = rb_next(n)) {
> > > >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> > > >>>> +							struct vfio_pfn, node);
> > > >>>> +				if (vpfn->iova >= i) {
> > > >>>> +					found = true;
> > > >>>> +					break;
> > > >>>> +				}
> > > >>>> +			}
> > > >>>> +
> > > >>>> +			if (!found) {
> > > >>>> +				i += dma->size;
> > > >>>> +				continue;
> > > >>>> +			}
> > > >>>> +
> > > >>>> +			for (; n; n = rb_next(n)) {
> > > >>>> +				unsigned int start;
> > > >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> > > >>>> +							struct vfio_pfn, node);
> > > >>>> +
> > > >>>> +				if (vpfn->iova >= iova + size)
> > > >>>> +					return;
> > > >>>> +
> > > >>>> +				start = (vpfn->iova - start_iova) >> pgshift;
> > > >>>> +
> > > >>>> +				__bitmap_set(bitmap, start, 1);
> > > >>>> +
> > > >>>> +				i = vpfn->iova + pgsize;
> > > >>>> +			}
> > > >>>> +		}
> > > >>>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> > > >>>> +	}
> > > >>>> +}
> > > >>>> +
> > > >>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> > > >>>> +{
> > > >>>> +	long bsize;
> > > >>>> +
> > > >>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> > > >>>> +		return -EINVAL;
> > > >>>> +
> > > >>>> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> > > >>>> +
> > > >>>> +	if (bitmap_size < bsize)
> > > >>>> +		return -EINVAL;
> > > >>>> +
> > > >>>> +	return bsize;
> > > >>>> +}
> > > >>>> +
> > > >>>>    static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > > >>>>    			     struct vfio_iommu_type1_dma_unmap *unmap)
> > > >>>>    {
> > > >>>> @@ -2298,6 +2417,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >>>>    
> > > >>>>    		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > >>>>    			-EFAULT : 0;
> > > >>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> > > >>>> +		struct vfio_iommu_type1_dirty_bitmap range;
> > > >>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> > > >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> > > >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> > > >>>> +		int ret;
> > > >>>> +
> > > >>>> +		if (!iommu->v2)
> > > >>>> +			return -EACCES;
> > > >>>> +
> > > >>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> > > >>>> +				    bitmap);
> > > >>>> +
> > > >>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> > > >>>> +			return -EFAULT;
> > > >>>> +
> > > >>>> +		if (range.argsz < minsz || range.flags & ~mask)
> > > >>>> +			return -EINVAL;
> > > >>>> +
> > > >>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> > > >>>> +			iommu->dirty_page_tracking = true;
> > > >>>> +			return 0;
> > > >>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> > > >>>> +			iommu->dirty_page_tracking = false;
> > > >>>> +
> > > >>>> +			mutex_lock(&iommu->lock);
> > > >>>> +			vfio_remove_unpinned_from_dma_list(iommu);
> > > >>>> +			mutex_unlock(&iommu->lock);
> > > >>>> +			return 0;
> > > >>>> +
> > > >>>> +		} else if (range.flags &
> > > >>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> > > >>>> +			uint64_t iommu_pgmask;
> > > >>>> +			unsigned long pgshift = __ffs(range.pgsize);
> > > >>>> +			unsigned long *bitmap;
> > > >>>> +			long bsize;
> > > >>>> +
> > > >>>> +			iommu_pgmask =
> > > >>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > > >>>> +
> > > >>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> > > >>>> +			    (range.pgsize - 1))
> > > >>>> +				return -EINVAL;
> > > >>>> +
> > > >>>> +			if (range.iova & iommu_pgmask)
> > > >>>> +				return -EINVAL;
> > > >>>> +			if (!range.size || range.size > SIZE_MAX)
> > > >>>> +				return -EINVAL;
> > > >>>> +			if (range.iova + range.size < range.iova)
> > > >>>> +				return -EINVAL;
> > > >>>> +
> > > >>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> > > >>>> +						   range.bitmap_size);
> > > >>>> +			if (bsize)
> > > >>>> +				return ret;
> > > >>>> +
> > > >>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> > > >>>> +			if (!bitmap)
> > > >>>> +				return -ENOMEM;
> > > >>>> +
> > > >>>> +			ret = copy_from_user(bitmap,
> > > >>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> > > >>>> +			if (ret)
> > > >>>> +				goto bitmap_exit;
> > > >>>> +
> > > >>>> +			iommu->dirty_page_tracking = false;
> > > >>> why iommu->dirty_page_tracking is false here?
> > > >>> suppose this ioctl can be called several times.
> > > >>>
> > > >>
> > > >> This ioctl can be called several times, but once this ioctl is called
> > > >> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
> > > >> stop-and-copy phase) and dirty pages bitmap are being queried by user.
> > > >>
> > > > can't agree that VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP can only be
> > > > called in stop-and-copy phase.
> > > > As stated in last version, this will cause QEMU to get a wrong expectation
> > > > of VM downtime and this is also the reason for previously pinned pages
> > > > before log_sync cannot be treated as dirty. If this get bitmap ioctl can
> > > > be called early in save_setup phase, then it's no problem even all ram
> > > > is dirty.
> > > > 
> > > 
> > > Device can also write to pages which are pinned, and then there is no 
> > > way to know pages dirtied by device during pre-copy phase.
> > > If user ask dirty bitmap in per-copy phase, even then user will have to 
> > > query dirty bitmap in stop-and-copy phase where this will be superset 
> > > including all pages reported during pre-copy. Then instead of copying 
> > > all pages twice, its better to do it once during stop-and-copy phase.
> > >
> > I think the flow should be like this:
> > 1. save_setup --> GET_BITMAP ioctl --> return bitmap for currently + previously
> > pinned pages and clean all previously pinned pages
> > 
> > 2. save_pending --> GET_BITMAP ioctl  --> return bitmap of (currently
> > pinned pages + previously pinned pages since last clean) and clean all
> > previously pinned pages
> > 
> > 3. save_complete_precopy --> GET_BITMAP ioctl --> return bitmap of (currently
> > pinned pages + previously pinned pages since last clean) and clean all
> > previously pinned pages
> > 
> > 
> > Copying pinned pages multiple times is unavoidable because those pinned pages
> > are always treated as dirty. That's per vendor's implementation.
> > But if the pinned pages are not reported as dirty before stop-and-copy phase,
> > QEMU would think dirty pages has converged
> > and enter blackout phase, making downtime_limit severely incorrect.
> 
> I'm not sure it's any worse.
> I *think* we do a last sync after we've decided to go to stop-and-copy;
> wont that then mark all those pages as dirty again, so it'll have the
> same behaviour?
No. something will be different.
currently, in kirti's implementation, if GET_BITMAP ioctl is called only
once in stop-and-copy phase, then before that phase, QEMU does not know those
pages are dirty. 
If we can report those dirty pages earlier before stop-and-copy phase,
QEMU can at least copy other pages to reduce dirty pages to below threshold.

Take a example, let's assume those vfio dirty pages is 1Gb, and network speed is
also 1Gb. Expected vm downtime is 1s.
If before stop-and-copy phase, dirty pages produced by other pages is
also 1Gb. To meet the expected vm downtime, QEMU should copy pages to
let dirty pages be less than 1Gb, otherwise, it should not complete live
migration.
If vfio does not report this 1Gb dirty pages, QEMU would think there's
only 1Gb and stop the vm. It would then find out there's actually 2Gb and vm
downtime is 2s.
Though the expected vm downtime is always not exactly the same as the
true vm downtime, it should be caused by rapid dirty page rate, which is
not predictable.
Right?

Thanks
Yan



> Anyway, it seems wrong to repeatedly send pages that you know are
> pointless - but that probably means we need a way to mark those somehow
> to avoid it.
> 
> Dave
> 
> > Thanks
> > Yan
> > 
> > > >>>> +			mutex_lock(&iommu->lock);
> > > >>>> +			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
> > > >>>> +					     range.pgsize, range.iova, bitmap);
> > > >>>> +			mutex_unlock(&iommu->lock);
> > > >>>> +
> > > >>>> +			ret = copy_to_user((void __user *)range.bitmap, bitmap,
> > > >>>> +					   range.bitmap_size) ? -EFAULT : 0;
> > > >>>> +bitmap_exit:
> > > >>>> +			kfree(bitmap);
> > > >>>> +			return ret;
> > > >>>> +		}
> > > >>>>    	}
> > > >>>>    
> > > >>>>    	return -ENOTTY;
> > > >>>> -- 
> > > >>>> 2.7.0
> > > >>>>
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-17 18:43       ` Alex Williamson
@ 2019-12-19 16:08         ` Kirti Wankhede
  2019-12-19 17:27           ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-19 16:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 12/18/2019 12:13 AM, Alex Williamson wrote:
> On Tue, 17 Dec 2019 11:58:44 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 12/17/2019 4:14 AM, Alex Williamson wrote:
>>> On Tue, 17 Dec 2019 01:51:36 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> - Defined MIGRATION region type and sub-type.
>>>>
>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>>>     offset of migration region to get/set VFIO device related information.
>>>>     Defined members of structure and usage on read/write access.
>>>>
>>>> - Defined device states and added state transition details in the comment.
>>>>
>>>> - Added sequence to be followed while saving and resuming VFIO device state
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>    include/uapi/linux/vfio.h | 180 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 180 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 9e843a147ead..a0817ba267c1 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>>>>    #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>>>>    #define VFIO_REGION_TYPE_GFX                    (1)
>>>>    #define VFIO_REGION_TYPE_CCW			(2)
>>>> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>>>>    
>>>>    /* sub-types for VFIO_REGION_TYPE_PCI_* */
>>>>    
>>>> @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid {
>>>>    /* sub-types for VFIO_REGION_TYPE_CCW */
>>>>    #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>>>>    
>>>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>>>> +
>>>> +/*
>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>>> + * information. Field accesses from this structure are only supported at their
>>>> + * native width and alignment, otherwise the result is undefined and vendor
>>>> + * drivers should return an error.
>>>> + *
>>>> + * device_state: (read/write)
>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>>> + *      to. If device state transition fails, write on this field return error.
>>>> + *      It consists of 3 bits:
>>>> + *      - If bit 0 set, indicates _RUNNING state. When its clear, that indicates
>>>
>>> s/its/it's/
>>>    
>>>> + *        _STOP state. When device is changed to _STOP, driver should stop
>>>> + *        device before write() returns.
>>>> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
>>>> + *        should start gathering device state information which will be provided
>>>> + *        to VFIO user space application to save device's state.
>>>> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
>>>> + *        prepare to resume device, data provided through migration region
>>>> + *        should be used to resume device.
>>>> + *      Bits 3 - 31 are reserved for future use. User should perform
>>>> + *      read-modify-write operation on this field.
>>>> + *
>>>> + *  +------- _RESUMING
>>>> + *  |+------ _SAVING
>>>> + *  ||+----- _RUNNING
>>>> + *  |||
>>>> + *  000b => Device Stopped, not saving or resuming
>>>> + *  001b => Device running state, default state
>>>> + *  010b => Stop Device & save device state, stop-and-copy state
>>>> + *  011b => Device running and save device state, pre-copy state
>>>> + *  100b => Device stopped and device state is resuming
>>>> + *  101b => Invalid state
>>>
>>> Eventually this would be intended for post-copy, if supported by the
>>> device, right?
>>>    
>>
>> No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't
>> be used for post-copy. New flag will be required for post-copy.
>>
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html
>>
>>>> + *  110b => Invalid state
>>>> + *  111b => Invalid state
>>>> + *
>>>> + * State transitions:
>>>> + *
>>>> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
>>>> + *                (100b)     (001b)     (011b)        (010b)       (000b)
>>>> + * 0. Running or Default state
>>>> + *                             |
>>>> + *
>>>> + * 1. Normal Shutdown
>>>
>>> Optional, userspace is under no obligation.
>>>    
>>>> + *                             |------------------------------------->|
>>>> + *
>>>> + * 2. Save state or Suspend
>>>> + *                             |------------------------->|---------->|
>>>> + *
>>>> + * 3. Save state during live migration
>>>> + *                             |----------->|------------>|---------->|
>>>> + *
>>>> + * 4. Resuming
>>>> + *                  |<---------|
>>>> + *
>>>> + * 5. Resumed
>>>> + *                  |--------->|
>>>> + *
>>>> + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts.
>>>> + * 1. During normal VFIO application shutdown, vfio device state changes
>>>> + *    from _RUNNING to _STOP.
>>>
>>> We cannot impose this requirement on existing userspace.  Userspace may
>>> perform this action, but they are not required to and the vendor driver
>>> must not require it.
>>
>> Updated comment.
>>
>>>    
>>>> + * 2. When VFIO application save state or suspend application, VFIO device
>>>> + *    state transition is from _RUNNING to stop-and-copy state and then to
>>>> + *    _STOP.
>>>> + *    On state transition from _RUNNING to stop-and-copy, driver must
>>>> + *    stop device, save device state and send it to application through
>>>> + *    migration region.
>>>> + *    On _RUNNING to stop-and-copy state transition failure, application should
>>>> + *    set VFIO device state to _RUNNING.
>>>
>>> A state transition failure means that the user's write to device_state
>>> failed, so is it the user's responsibility to set the next state?
>>
>> Right.
> 
> If a transition failure occurs, ie. errno from write(2), what value is
> reported by a read(2) of device_state in the interim between the failure
> and a next state written by the user? 

Since state transition has failed, driver should return previous state.

> If this is a valid state,
> wouldn't it be reasonable for the user to assume the device is already
> operating in that state?  If it's an invalid state, do we need to
> define the use cases for those invalid states?  If the user needs to
> set the state back to _RUNNING, that suggests the device might be
> stopped, which has implications beyond the migration state.
> 

Not necessarily stopped. For example, during live migration:

*              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
*                (100b)     (001b)     (011b)        (010b)       (000b)
*
* 3. Save state during live migration
*                             |----------->|------------>|---------->|

on any state transition failure, user should set _RUNNING state.
pre-copy (011b) -> stop-and-copy(010b)  =====> _SAVING flag is cleared 
and device returned back to _RUNNING.
Stop-and-copy(010b) -> STOP (000b) ====> device is already stopped.


>>>   Why
>>> is it necessarily _RUNNING vs _STOP?
>>>   
>>
>> While changing From pre-copy to stop-and-copy transition, device is
>> still running, only saving of device state started. Now if transition to
>> stop-and-copy fails, from user point of view application or VM is still
>> running, device state should be set to _RUNNING so that whatever the
>> application/VM is running should continue at source.
> 
> Seems it's the users discretion whether to consider this continuable or
> fatal, the vfio interface specification should support a given usage
> model, not prescribe it.
> 

Updating comment.

>>>> + * 3. In VFIO application live migration, state transition is from _RUNNING
>>>> + *    to pre-copy to stop-and-copy to _STOP.
>>>> + *    On state transition from _RUNNING to pre-copy, driver should start
>>>> + *    gathering device state while application is still running and send device
>>>> + *    state data to application through migration region.
>>>> + *    On state transition from pre-copy to stop-and-copy, driver must stop
>>>> + *    device, save device state and send it to application through migration
>>>> + *    region.
>>>> + *    On any failure during any of these state transition, VFIO device state
>>>> + *    should be set to _RUNNING.
>>>
>>> Same comment as above regarding next state on failure.
>>>    
>>
>> If application or VM migration fails, it should continue to run at
>> source. In case of VM, guest user isn't aware of migration, and from his
>> point VM should be running.
> 
> vfio is not prescribing the migration semantics to userspace, it's
> presenting an interface that support the user semantics.  Therefore,
> while it's useful to understand the expected usage model, I think we
> also need a mechanism that the user can always determine the
> device_state after a fault

If state transition fails, device is in previous state and driver should 
return previous state

> and allowable state transitions independent
> of the expected usage model.  

Do you mean to define array of ['from','to'], same as runstate 
transition array in QEMU?
  static const RunStateTransition runstate_transitions_def[]


> For example, I think a user should always
> be allowed to transition a device to stopped regardless of the expected
> migration flow.  An error might have occurred elsewhere and we want to
> stop everything for debugging.  I think it's also allowable to switch
> directly from running to stop-and-copy, for example to save and resume
> a VM offline.
>   
>>> Also, it seems like it's the vendor driver's discretion to actually
>>> provide data during the pre-copy phase.  As we've defined it, the
>>> vendor driver needs to participate in the migration region regardless,
>>> they might just always report no pending_bytes until we enter
>>> stop-and-copy.
>>>    
>>
>> Yes. And if pending_bytes are reported as 0 in pre-copy by vendor driver
>> then QEMU doesn't reiterate for that device.
> 
> Maybe we can state that as the expected mechanism to avoid a vendor
> driver trying to invent alternative means, ex. failing transition to
> pre-copy, requesting new flags, etc.
> 

Isn't Sequence to be followed below sufficient to state that?


>>>> + * 4. To start resuming phase, VFIO device state should be transitioned from
>>>> + *    _RUNNING to _RESUMING state.
>>>> + *    In _RESUMING state, driver should use received device state data through
>>>> + *    migration region to resume device.
>>>> + *    On failure during this state transition, application should set _RUNNING
>>>> + *    state.
>>>
>>> Same comment regarding setting next state after failure.
>>
>> If device couldn't be transitioned to _RESUMING, then it should be set
>> to default state, that is _RUNNING.
>>
>>>    
>>>> + * 5. On providing saved device data to driver, appliation should change state
>>>> + *    from _RESUMING to _RUNNING.
>>>> + *    On failure to transition to _RUNNING state, VFIO application should reset
>>>> + *    the device and set _RUNNING state so that device doesn't remain in unknown
>>>> + *    or bad state. On reset, driver must reset device and device should be
>>>> + *    available in default usable state.
>>>
>>> Didn't we discuss that the reset ioctl should return the device to the
>>> initial state, including the transition to _RUNNING?
>>
>> Yes, that's default usable state, rewording it to initial state.
>>
>>>   Also, as above,
>>> it's the user write that triggers the failure, this register is listed
>>> as read-write, so what value does the vendor driver report for the
>>> state when read after a transition failure?  Is it reported as _RESUMING
>>> as it was prior to the attempted transition, or may the invalid states
>>> be used by the vendor driver to indicate the device is broken?
>>>    
>>
>> If transition as failed, device should report its previous state and
>> reset device should bring back to usable _RUNNING state.
> 
> If device_state reports previous state then user should reasonably
> infer that the device is already in that sate without a need for them
> to set it, IMO.

But if there is any error in read()/write() then user should device 
which next state device should be put in, which would be different that 
previous state.

> 
>>>> + *
>>>> + * pending bytes: (read only)
>>>> + *      Number of pending bytes yet to be migrated from vendor driver
>>>> + *
>>>> + * data_offset: (read only)
>>>> + *      User application should read data_offset in migration region from where
>>>> + *      user application should read device data during _SAVING state or write
>>>> + *      device data during _RESUMING state. See below for detail of sequence to
>>>> + *      be followed.
>>>> + *
>>>> + * data_size: (read/write)
>>>> + *      User application should read data_size to get size of data copied in
>>>> + *      bytes in migration region during _SAVING state and write size of data
>>>> + *      copied in bytes in migration region during _RESUMING state.
>>>> + *
>>>> + * Migration region looks like:
>>>> + *  ------------------------------------------------------------------
>>>> + * |vfio_device_migration_info|    data section                      |
>>>> + * |                          |     ///////////////////////////////  |
>>>> + * ------------------------------------------------------------------
>>>> + *   ^                              ^
>>>> + *  offset 0-trapped part        data_offset
>>>> + *
>>>> + * Structure vfio_device_migration_info is always followed by data section in
>>>> + * the region, so data_offset will always be non-0. Offset from where data is
>>>> + * copied is decided by kernel driver, data section can be trapped or mapped
>>>> + * or partitioned, depending on how kernel driver defines data section.
>>>> + * Data section partition can be defined as mapped by sparse mmap capability.
>>>> + * If mmapped, then data_offset should be page aligned, where as initial section
>>>> + * which contain vfio_device_migration_info structure might not end at offset
>>>> + * which is page aligned. The user is not required to access via mmap regardless
>>>> + * of the region mmap capabilities.
>>>> + * Vendor driver should decide whether to partition data section and how to
>>>> + * partition the data section. Vendor driver should return data_offset
>>>> + * accordingly.
>>>> + *
>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>> + * a. read pending_bytes, indicates start of new iteration to get device data.
>>>> + *    If there was previous iteration, then this read operation indicates
>>>> + *    previous iteration is done. If pending_bytes > 0, go through below steps.
>>>> + * b. read data_offset, indicates kernel driver to make data available through
>>>> + *    data section. Kernel driver should return this read operation only after
>>>> + *    data is available from (region + data_offset) to (region + data_offset +
>>>> + *    data_size).
>>>> + * c. read data_size, amount of data in bytes available through migration
>>>> + *    region.
>>>> + * d. read data of data_size bytes from (region + data_offset) from migration
>>>> + *    region.
>>>> + * e. process data.
>>>> + * f. Loop through a to e.
>>>
>>> It seems we always need to end an iteration by reading pending_bytes to
>>> signal to the vendor driver to release resources, so should the end of
>>> the loop be:
>>>
>>> e. Read pending_bytes
>>> f. Goto b. or optionally restart next iteration at a.
>>>
>>> I think this is defined such that reading data_offset commits resources
>>> and reading pending_bytes frees them, allowing userspace to restart at
>>> reading pending_bytes with no side-effects.  Therefore reading
>>> pending_bytes repeatedly is supported.  Is the same true for
>>> data_offset and data_size?  It seems reasonable that the vendor driver
>>> can simply return offset and size for the current buffer if the user
>>> reads these more than once.
>>>   
>>
>> Right.
> 
> Can we add that to the spec?
> 

ok.

>>> How is a protocol or device error signaled?  For example, we can have a
>>> user error where they read data_size before data_offset.  Should the
>>> vendor driver generate a fault reading data_size in this case.  We can
>>> also have internal errors in the vendor driver, should the vendor
>>> driver use a special errno or update device_state autonomously to
>>> indicate such an error?
>>
>> If there is any error during the sequence, vendor driver can return
>> error code for next read/write operation, that will terminate the loop
>> and migration would fail.
> 
> Please add to spec.
> 

Ok

>>> I believe it's also part of the intended protocol that the user can
>>> transition from _SAVING|_RUNNING to _SAVING at any point, regardless of
>>> pending_bytes.  This should be noted.
>>>    
>>
>> Ok. Updating comment.
>>
>>>> + *
>>>> + * Sequence to be followed while _RESUMING device state:
>>>> + * While data for this device is available, repeat below steps:
>>>> + * a. read data_offset from where user application should write data.
>>>> + * b. write data of data_size to migration region from data_offset.
>>>
>>> Whose's data_size, the _SAVING end or the _RESUMING end?  I think this
>>> is intended to be the transaction size from the _SAVING source,
>>
>> Not necessarily. data_size could be MIN(transaction size of source,
>> migration data section). If migration data section is smaller than data
>> packet size at source, then it has to be broken and iteratively sent.
> 
> So you're saying that a transaction from the source is divisible by the
> user under certain conditions.  What other conditions exist?

I don't think there are any other conditions than above.

>  Can the
> user decide arbitrary sizes less than the MIN() stated above?  This
> needs to be specified.
>

No, User can't decide arbitrary sizes.


>>> but it
>>> could easily be misinterpreted as reading data_size on the _RESUMING
>>> end.
>>>    
>>>> + * c. write data_size which indicates vendor driver that data is written in
>>>> + *    staging buffer. Vendor driver should read this data from migration
>>>> + *    region and resume device's state.
>>>
>>> I think we also need to define the error protocol.  The user could
>>> mis-order transactions or there could be an internal error in the
>>> vendor driver or device.  Are all read(2)/write(2) operations
>>> susceptible to defined errnos to signal this?
>>
>> Yes.
> 
> And those defined errnos are specified...
> 

Those could be standard errors like -EINVAL, ENOMEM....

>>>   Is it reflected in
>>> device_state?
>>
>> No.
> 
> So a user should do what, just keep trying?
>

No, fail migration process. If error is at source or destination then 
user can decide either resume at source or terminate application.

>>> What's the recovery protocol?
>>>    
>>
>> On read()/write() failure user should take necessary action.
> 
> Where is that necessary action defined?  Can they just try again?  Do
> they transition in and out of _RESUMING to try again?  Do they need to
> reset the device?
> 

User application should decide what action to take on failure, right?
  "vfio is not prescribing the migration semantics to userspace, it's
presenting an interface that support the user semantics."

>>>> + *
>>>> + * For user application, data is opaque. User should write data in the same
>>>> + * order as received.
>>>
>>> Order and transaction size, ie. each data_size chunk is indivisible by
>>> the user.
>>
>> Transaction size can differ, but order should remain same.
> 
> Under what circumstances and to what extent can transaction size
> differ?

It depends in migration region size.

>  Is the MIN() algorithm above the absolute lower bound or just
> a suggestion?



>  Is the user allowed to concatenate transactions from the
> source together on the target if the region is sufficiently large?

Yes that can be done, because data is just byte stream for user. Vendor 
driver receives the byte stream and knows how to decode it.

>  It
> seems like quite an imposition on the vendor driver to support this
> flexibility.
> 
>>>> + */
>>>> +
>>>> +struct vfio_device_migration_info {
>>>> +	__u32 device_state;         /* VFIO device state */
>>>> +#define VFIO_DEVICE_STATE_STOP      (1 << 0)
>>>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>>>
>>> Huh?  We should probably just refer to it consistently, ie. _RUNNING
>>> and !_RUNNING, otherwise we have the incongruity that setting the _STOP
>>> value is actually the opposite of the necessary logic value (_STOP = 1
>>> is _RUNNING, _STOP = 0 is !_RUNNING).
>>
>> Ops, my mistake, forgot to update to
>> #define VFIO_DEVICE_STATE_STOP      (0)
>>
>>>    
>>>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>>>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>>>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>>>> +				     VFIO_DEVICE_STATE_SAVING |  \
>>>> +				     VFIO_DEVICE_STATE_RESUMING)
>>>> +
>>>> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
>>>> +					    VFIO_DEVICE_STATE_RESUMING)
>>>> +
>>>> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
>>>> +					    VFIO_DEVICE_STATE_RESUMING)
>>>
>>> Gack, we fixed these in the last iteration!
>>>    
>>
>> That solution doesn't scale when new flags will be added. I still prefer
>> to define as above.
> 
> I see, the argument was buried in a reply to Yan, sorry if I missed it:
> 
>>>> These seem difficult to use, maybe we just need a
>>>> VFIO_DEVICE_STATE_VALID macro?
>>>>
>>>> #define VFIO_DEVICE_STATE_VALID(state) \
>>>>     (state & VFIO_DEVICE_STATE_RESUMING ? \
>>>>     (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
>>>>   
>>
>> This will not be work when use of other bits gets added in future.
>> That's the reason I preferred to add individual invalid states which
>> user should check.
> 
> I would argue that what doesn't scale is having numerous CASE1, CASE2,
> CASEn conditions elsewhere in the kernel rather than have a unified,
> single macro that defines a valid state.  How do you worry this will be
> a problem when new flags are added, can't we just update the macro?

Adding macro you suggested above. Lets figure out how to solve problem 
with new flags when new flags gets added.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-19  0:57               ` Yan Zhao
@ 2019-12-19 16:21                 ` Kirti Wankhede
  2019-12-20  0:58                   ` Yan Zhao
  0 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-19 16:21 UTC (permalink / raw)
  To: Yan Zhao, Dr. David Alan Gilbert
  Cc: alex.williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	Ken.Xue, Wang, Zhi A, qemu-devel, kvm



On 12/19/2019 6:27 AM, Yan Zhao wrote:
> On Thu, Dec 19, 2019 at 04:05:52AM +0800, Dr. David Alan Gilbert wrote:
>> * Yan Zhao (yan.y.zhao@intel.com) wrote:
>>> On Tue, Dec 17, 2019 at 07:47:05PM +0800, Kirti Wankhede wrote:
>>>>
>>>>
>>>> On 12/17/2019 3:21 PM, Yan Zhao wrote:
>>>>> On Tue, Dec 17, 2019 at 05:24:14PM +0800, Kirti Wankhede wrote:
>>>>>>>>     
>>>>>>>>     		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>>>>>     			-EFAULT : 0;
>>>>>>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>>>>>>>> +		struct vfio_iommu_type1_dirty_bitmap range;
>>>>>>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>>>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>>>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>>>>>>>> +		int ret;
>>>>>>>> +
>>>>>>>> +		if (!iommu->v2)
>>>>>>>> +			return -EACCES;
>>>>>>>> +
>>>>>>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>>>>>>>> +				    bitmap);
>>>>>>>> +
>>>>>>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
>>>>>>>> +			return -EFAULT;
>>>>>>>> +
>>>>>>>> +		if (range.argsz < minsz || range.flags & ~mask)
>>>>>>>> +			return -EINVAL;
>>>>>>>> +
>>>>>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>>>>>>>> +			iommu->dirty_page_tracking = true;
>>>>>>>> +			return 0;
>>>>>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>>>>>>>> +			iommu->dirty_page_tracking = false;
>>>>>>>> +
>>>>>>>> +			mutex_lock(&iommu->lock);
>>>>>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
>>>>>>>> +			mutex_unlock(&iommu->lock);
>>>>>>>> +			return 0;
>>>>>>>> +
>>>>>>>> +		} else if (range.flags &
>>>>>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>>>>>>>> +			uint64_t iommu_pgmask;
>>>>>>>> +			unsigned long pgshift = __ffs(range.pgsize);
>>>>>>>> +			unsigned long *bitmap;
>>>>>>>> +			long bsize;
>>>>>>>> +
>>>>>>>> +			iommu_pgmask =
>>>>>>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>>>>>>>> +
>>>>>>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
>>>>>>>> +			    (range.pgsize - 1))
>>>>>>>> +				return -EINVAL;
>>>>>>>> +
>>>>>>>> +			if (range.iova & iommu_pgmask)
>>>>>>>> +				return -EINVAL;
>>>>>>>> +			if (!range.size || range.size > SIZE_MAX)
>>>>>>>> +				return -EINVAL;
>>>>>>>> +			if (range.iova + range.size < range.iova)
>>>>>>>> +				return -EINVAL;
>>>>>>>> +
>>>>>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>>>>>>>> +						   range.bitmap_size);
>>>>>>>> +			if (bsize)
>>>>>>>> +				return ret;
>>>>>>>> +
>>>>>>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
>>>>>>>> +			if (!bitmap)
>>>>>>>> +				return -ENOMEM;
>>>>>>>> +
>>>>>>>> +			ret = copy_from_user(bitmap,
>>>>>>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
>>>>>>>> +			if (ret)
>>>>>>>> +				goto bitmap_exit;
>>>>>>>> +
>>>>>>>> +			iommu->dirty_page_tracking = false;
>>>>>>> why iommu->dirty_page_tracking is false here?
>>>>>>> suppose this ioctl can be called several times.
>>>>>>>
>>>>>>
>>>>>> This ioctl can be called several times, but once this ioctl is called
>>>>>> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
>>>>>> stop-and-copy phase) and dirty pages bitmap are being queried by user.
>>>>>>
>>>>> can't agree that VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP can only be
>>>>> called in stop-and-copy phase.
>>>>> As stated in last version, this will cause QEMU to get a wrong expectation
>>>>> of VM downtime and this is also the reason for previously pinned pages
>>>>> before log_sync cannot be treated as dirty. If this get bitmap ioctl can
>>>>> be called early in save_setup phase, then it's no problem even all ram
>>>>> is dirty.
>>>>>
>>>>
>>>> Device can also write to pages which are pinned, and then there is no
>>>> way to know pages dirtied by device during pre-copy phase.
>>>> If user ask dirty bitmap in per-copy phase, even then user will have to
>>>> query dirty bitmap in stop-and-copy phase where this will be superset
>>>> including all pages reported during pre-copy. Then instead of copying
>>>> all pages twice, its better to do it once during stop-and-copy phase.
>>>>
>>> I think the flow should be like this:
>>> 1. save_setup --> GET_BITMAP ioctl --> return bitmap for currently + previously
>>> pinned pages and clean all previously pinned pages
>>>
>>> 2. save_pending --> GET_BITMAP ioctl  --> return bitmap of (currently
>>> pinned pages + previously pinned pages since last clean) and clean all
>>> previously pinned pages
>>>
>>> 3. save_complete_precopy --> GET_BITMAP ioctl --> return bitmap of (currently
>>> pinned pages + previously pinned pages since last clean) and clean all
>>> previously pinned pages
>>>
>>>
>>> Copying pinned pages multiple times is unavoidable because those pinned pages
>>> are always treated as dirty. That's per vendor's implementation.
>>> But if the pinned pages are not reported as dirty before stop-and-copy phase,
>>> QEMU would think dirty pages has converged
>>> and enter blackout phase, making downtime_limit severely incorrect.
>>
>> I'm not sure it's any worse.
>> I *think* we do a last sync after we've decided to go to stop-and-copy;
>> wont that then mark all those pages as dirty again, so it'll have the
>> same behaviour?
> No. something will be different.
> currently, in kirti's implementation, if GET_BITMAP ioctl is called only
> once in stop-and-copy phase, then before that phase, QEMU does not know those
> pages are dirty.
> If we can report those dirty pages earlier before stop-and-copy phase,
> QEMU can at least copy other pages to reduce dirty pages to below threshold.
> 
> Take a example, let's assume those vfio dirty pages is 1Gb, and network speed is
> also 1Gb. Expected vm downtime is 1s.
> If before stop-and-copy phase, dirty pages produced by other pages is
> also 1Gb. To meet the expected vm downtime, QEMU should copy pages to
> let dirty pages be less than 1Gb, otherwise, it should not complete live
> migration.
> If vfio does not report this 1Gb dirty pages, QEMU would think there's
> only 1Gb and stop the vm. It would then find out there's actually 2Gb and vm
> downtime is 2s.
> Though the expected vm downtime is always not exactly the same as the
> true vm downtime, it should be caused by rapid dirty page rate, which is
> not predictable.
> Right?
> 

If you report vfio dirty pages 1Gb before stop-and-copy phase (i.e. in 
pre-copy phase), enter into stop-and-copy phase, how will you know which 
and how many pages are dirtied by device from the time when pages copied 
in pre-copy phase to that time where device is stopped? You don't have a 
way to know which pages are dirtied by device. So ideally device can 
write to all pages which are pinned. Then we have to mark all those 
pinned pages dirty in stop-and-copy phase, 1Gb, and copy to destination. 
Now you had copied same pages twice. Shouldn't we try not to copy pages 
twice?

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-19 16:08         ` Kirti Wankhede
@ 2019-12-19 17:27           ` Alex Williamson
  2019-12-19 20:10             ` Kirti Wankhede
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2019-12-19 17:27 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Thu, 19 Dec 2019 21:38:44 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/18/2019 12:13 AM, Alex Williamson wrote:
> > On Tue, 17 Dec 2019 11:58:44 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 12/17/2019 4:14 AM, Alex Williamson wrote:  
> >>> On Tue, 17 Dec 2019 01:51:36 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> - Defined MIGRATION region type and sub-type.
> >>>>
> >>>> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>>>     offset of migration region to get/set VFIO device related information.
> >>>>     Defined members of structure and usage on read/write access.
> >>>>
> >>>> - Defined device states and added state transition details in the comment.
> >>>>
> >>>> - Added sequence to be followed while saving and resuming VFIO device state
> >>>>
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>> ---
> >>>>    include/uapi/linux/vfio.h | 180 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>    1 file changed, 180 insertions(+)
> >>>>
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index 9e843a147ead..a0817ba267c1 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
> >>>>    #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
> >>>>    #define VFIO_REGION_TYPE_GFX                    (1)
> >>>>    #define VFIO_REGION_TYPE_CCW			(2)
> >>>> +#define VFIO_REGION_TYPE_MIGRATION              (3)
> >>>>    
> >>>>    /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >>>>    
> >>>> @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid {
> >>>>    /* sub-types for VFIO_REGION_TYPE_CCW */
> >>>>    #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
> >>>>    
> >>>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> >>>> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> >>>> +
> >>>> +/*
> >>>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >>>> + * information. Field accesses from this structure are only supported at their
> >>>> + * native width and alignment, otherwise the result is undefined and vendor
> >>>> + * drivers should return an error.
> >>>> + *
> >>>> + * device_state: (read/write)
> >>>> + *      To indicate vendor driver the state VFIO device should be transitioned
> >>>> + *      to. If device state transition fails, write on this field return error.
> >>>> + *      It consists of 3 bits:
> >>>> + *      - If bit 0 set, indicates _RUNNING state. When its clear, that indicates  
> >>>
> >>> s/its/it's/
> >>>      
> >>>> + *        _STOP state. When device is changed to _STOP, driver should stop
> >>>> + *        device before write() returns.
> >>>> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> >>>> + *        should start gathering device state information which will be provided
> >>>> + *        to VFIO user space application to save device's state.
> >>>> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> >>>> + *        prepare to resume device, data provided through migration region
> >>>> + *        should be used to resume device.
> >>>> + *      Bits 3 - 31 are reserved for future use. User should perform
> >>>> + *      read-modify-write operation on this field.
> >>>> + *
> >>>> + *  +------- _RESUMING
> >>>> + *  |+------ _SAVING
> >>>> + *  ||+----- _RUNNING
> >>>> + *  |||
> >>>> + *  000b => Device Stopped, not saving or resuming
> >>>> + *  001b => Device running state, default state
> >>>> + *  010b => Stop Device & save device state, stop-and-copy state
> >>>> + *  011b => Device running and save device state, pre-copy state
> >>>> + *  100b => Device stopped and device state is resuming
> >>>> + *  101b => Invalid state  
> >>>
> >>> Eventually this would be intended for post-copy, if supported by the
> >>> device, right?
> >>>      
> >>
> >> No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't
> >> be used for post-copy. New flag will be required for post-copy.
> >>
> >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html
> >>  
> >>>> + *  110b => Invalid state
> >>>> + *  111b => Invalid state
> >>>> + *
> >>>> + * State transitions:
> >>>> + *
> >>>> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> >>>> + *                (100b)     (001b)     (011b)        (010b)       (000b)
> >>>> + * 0. Running or Default state
> >>>> + *                             |
> >>>> + *
> >>>> + * 1. Normal Shutdown  
> >>>
> >>> Optional, userspace is under no obligation.
> >>>      
> >>>> + *                             |------------------------------------->|
> >>>> + *
> >>>> + * 2. Save state or Suspend
> >>>> + *                             |------------------------->|---------->|
> >>>> + *
> >>>> + * 3. Save state during live migration
> >>>> + *                             |----------->|------------>|---------->|
> >>>> + *
> >>>> + * 4. Resuming
> >>>> + *                  |<---------|
> >>>> + *
> >>>> + * 5. Resumed
> >>>> + *                  |--------->|
> >>>> + *
> >>>> + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts.
> >>>> + * 1. During normal VFIO application shutdown, vfio device state changes
> >>>> + *    from _RUNNING to _STOP.  
> >>>
> >>> We cannot impose this requirement on existing userspace.  Userspace may
> >>> perform this action, but they are not required to and the vendor driver
> >>> must not require it.  
> >>
> >> Updated comment.
> >>  
> >>>      
> >>>> + * 2. When VFIO application save state or suspend application, VFIO device
> >>>> + *    state transition is from _RUNNING to stop-and-copy state and then to
> >>>> + *    _STOP.
> >>>> + *    On state transition from _RUNNING to stop-and-copy, driver must
> >>>> + *    stop device, save device state and send it to application through
> >>>> + *    migration region.
> >>>> + *    On _RUNNING to stop-and-copy state transition failure, application should
> >>>> + *    set VFIO device state to _RUNNING.  
> >>>
> >>> A state transition failure means that the user's write to device_state
> >>> failed, so is it the user's responsibility to set the next state?  
> >>
> >> Right.  
> > 
> > If a transition failure occurs, ie. errno from write(2), what value is
> > reported by a read(2) of device_state in the interim between the failure
> > and a next state written by the user?   
> 
> Since state transition has failed, driver should return previous state.
> 
> > If this is a valid state,
> > wouldn't it be reasonable for the user to assume the device is already
> > operating in that state?

This ^^^

>  If it's an invalid state, do we need to
> > define the use cases for those invalid states?  If the user needs to
> > set the state back to _RUNNING, that suggests the device might be
> > stopped, which has implications beyond the migration state.
> >   
> 
> Not necessarily stopped. For example, during live migration:
> 
> *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> *                (100b)     (001b)     (011b)        (010b)       (000b)
> *
> * 3. Save state during live migration
> *                             |----------->|------------>|---------->|
> 
> on any state transition failure, user should set _RUNNING state.
> pre-copy (011b) -> stop-and-copy(010b)  =====> _SAVING flag is cleared 
> and device returned back to _RUNNING.
> Stop-and-copy(010b) -> STOP (000b) ====> device is already stopped.

IMO, the user may modify the state, but the vendor driver should report
the current state of the device via device_state, which the user may
read after a transition error occurs.  The spec lacks a provision for
indicating the device is in a non-functional state.
 
> >>>   Why
> >>> is it necessarily _RUNNING vs _STOP?
> >>>     
> >>
> >> While changing From pre-copy to stop-and-copy transition, device is
> >> still running, only saving of device state started. Now if transition to
> >> stop-and-copy fails, from user point of view application or VM is still
> >> running, device state should be set to _RUNNING so that whatever the
> >> application/VM is running should continue at source.  
> > 
> > Seems it's the users discretion whether to consider this continuable or
> > fatal, the vfio interface specification should support a given usage
> > model, not prescribe it.
> >   
> 
> Updating comment.
> 
> >>>> + * 3. In VFIO application live migration, state transition is from _RUNNING
> >>>> + *    to pre-copy to stop-and-copy to _STOP.
> >>>> + *    On state transition from _RUNNING to pre-copy, driver should start
> >>>> + *    gathering device state while application is still running and send device
> >>>> + *    state data to application through migration region.
> >>>> + *    On state transition from pre-copy to stop-and-copy, driver must stop
> >>>> + *    device, save device state and send it to application through migration
> >>>> + *    region.
> >>>> + *    On any failure during any of these state transition, VFIO device state
> >>>> + *    should be set to _RUNNING.  
> >>>
> >>> Same comment as above regarding next state on failure.
> >>>      
> >>
> >> If application or VM migration fails, it should continue to run at
> >> source. In case of VM, guest user isn't aware of migration, and from his
> >> point VM should be running.  
> > 
> > vfio is not prescribing the migration semantics to userspace, it's
> > presenting an interface that support the user semantics.  Therefore,
> > while it's useful to understand the expected usage model, I think we
> > also need a mechanism that the user can always determine the
> > device_state after a fault  
> 
> If state transition fails, device is in previous state and driver should 
> return previous state

Then why is it stated that the user needs to set the _RUNNING state?
It's the user's choice.  But I do think we're lacking a state to
indicate an internal fault.
 
> > and allowable state transitions independent
> > of the expected usage model.    
> 
> Do you mean to define array of ['from','to'], same as runstate 
> transition array in QEMU?
>   static const RunStateTransition runstate_transitions_def[]

I'm thinking that independent of expected QEMU usage models, are there
any invalid transitions or is every state reachable from every other
state.  I'm afraid this design is so focused on a specific usage model
that vendor drivers are going to fall over if the user invokes a
transition outside of those listed above.  If there are invalid
transitions, those should be listed so they can be handled
consistently.  If there are no invalid transitions, it should be noted
in the spec to encourage vendor drivers to expect this.

> > For example, I think a user should always
> > be allowed to transition a device to stopped regardless of the expected
> > migration flow.  An error might have occurred elsewhere and we want to
> > stop everything for debugging.  I think it's also allowable to switch
> > directly from running to stop-and-copy, for example to save and resume
> > a VM offline.
> >     
> >>> Also, it seems like it's the vendor driver's discretion to actually
> >>> provide data during the pre-copy phase.  As we've defined it, the
> >>> vendor driver needs to participate in the migration region regardless,
> >>> they might just always report no pending_bytes until we enter
> >>> stop-and-copy.
> >>>      
> >>
> >> Yes. And if pending_bytes are reported as 0 in pre-copy by vendor driver
> >> then QEMU doesn't reiterate for that device.  
> > 
> > Maybe we can state that as the expected mechanism to avoid a vendor
> > driver trying to invent alternative means, ex. failing transition to
> > pre-copy, requesting new flags, etc.
> >   
> 
> Isn't Sequence to be followed below sufficient to state that?

I think we understand it because we've been discussing it so long, but
without that background it could be subtle.
 
> >>>> + * 4. To start resuming phase, VFIO device state should be transitioned from
> >>>> + *    _RUNNING to _RESUMING state.
> >>>> + *    In _RESUMING state, driver should use received device state data through
> >>>> + *    migration region to resume device.
> >>>> + *    On failure during this state transition, application should set _RUNNING
> >>>> + *    state.  
> >>>
> >>> Same comment regarding setting next state after failure.  
> >>
> >> If device couldn't be transitioned to _RESUMING, then it should be set
> >> to default state, that is _RUNNING.
> >>  
> >>>      
> >>>> + * 5. On providing saved device data to driver, appliation should change state
> >>>> + *    from _RESUMING to _RUNNING.
> >>>> + *    On failure to transition to _RUNNING state, VFIO application should reset
> >>>> + *    the device and set _RUNNING state so that device doesn't remain in unknown
> >>>> + *    or bad state. On reset, driver must reset device and device should be
> >>>> + *    available in default usable state.  
> >>>
> >>> Didn't we discuss that the reset ioctl should return the device to the
> >>> initial state, including the transition to _RUNNING?  
> >>
> >> Yes, that's default usable state, rewording it to initial state.
> >>  
> >>>   Also, as above,
> >>> it's the user write that triggers the failure, this register is listed
> >>> as read-write, so what value does the vendor driver report for the
> >>> state when read after a transition failure?  Is it reported as _RESUMING
> >>> as it was prior to the attempted transition, or may the invalid states
> >>> be used by the vendor driver to indicate the device is broken?
> >>>      
> >>
> >> If transition as failed, device should report its previous state and
> >> reset device should bring back to usable _RUNNING state.  
> > 
> > If device_state reports previous state then user should reasonably
> > infer that the device is already in that sate without a need for them
> > to set it, IMO.  
> 
> But if there is any error in read()/write() then user should device 
> which next state device should be put in, which would be different that 
> previous state.

That's a different answer than telling the user the next state should
be _RUNNING.

> >>>> + *
> >>>> + * pending bytes: (read only)
> >>>> + *      Number of pending bytes yet to be migrated from vendor driver
> >>>> + *
> >>>> + * data_offset: (read only)
> >>>> + *      User application should read data_offset in migration region from where
> >>>> + *      user application should read device data during _SAVING state or write
> >>>> + *      device data during _RESUMING state. See below for detail of sequence to
> >>>> + *      be followed.
> >>>> + *
> >>>> + * data_size: (read/write)
> >>>> + *      User application should read data_size to get size of data copied in
> >>>> + *      bytes in migration region during _SAVING state and write size of data
> >>>> + *      copied in bytes in migration region during _RESUMING state.
> >>>> + *
> >>>> + * Migration region looks like:
> >>>> + *  ------------------------------------------------------------------
> >>>> + * |vfio_device_migration_info|    data section                      |
> >>>> + * |                          |     ///////////////////////////////  |
> >>>> + * ------------------------------------------------------------------
> >>>> + *   ^                              ^
> >>>> + *  offset 0-trapped part        data_offset
> >>>> + *
> >>>> + * Structure vfio_device_migration_info is always followed by data section in
> >>>> + * the region, so data_offset will always be non-0. Offset from where data is
> >>>> + * copied is decided by kernel driver, data section can be trapped or mapped
> >>>> + * or partitioned, depending on how kernel driver defines data section.
> >>>> + * Data section partition can be defined as mapped by sparse mmap capability.
> >>>> + * If mmapped, then data_offset should be page aligned, where as initial section
> >>>> + * which contain vfio_device_migration_info structure might not end at offset
> >>>> + * which is page aligned. The user is not required to access via mmap regardless
> >>>> + * of the region mmap capabilities.
> >>>> + * Vendor driver should decide whether to partition data section and how to
> >>>> + * partition the data section. Vendor driver should return data_offset
> >>>> + * accordingly.
> >>>> + *
> >>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >>>> + * and for _SAVING device state or stop-and-copy phase:
> >>>> + * a. read pending_bytes, indicates start of new iteration to get device data.
> >>>> + *    If there was previous iteration, then this read operation indicates
> >>>> + *    previous iteration is done. If pending_bytes > 0, go through below steps.
> >>>> + * b. read data_offset, indicates kernel driver to make data available through
> >>>> + *    data section. Kernel driver should return this read operation only after
> >>>> + *    data is available from (region + data_offset) to (region + data_offset +
> >>>> + *    data_size).
> >>>> + * c. read data_size, amount of data in bytes available through migration
> >>>> + *    region.
> >>>> + * d. read data of data_size bytes from (region + data_offset) from migration
> >>>> + *    region.
> >>>> + * e. process data.
> >>>> + * f. Loop through a to e.  
> >>>
> >>> It seems we always need to end an iteration by reading pending_bytes to
> >>> signal to the vendor driver to release resources, so should the end of
> >>> the loop be:
> >>>
> >>> e. Read pending_bytes
> >>> f. Goto b. or optionally restart next iteration at a.
> >>>
> >>> I think this is defined such that reading data_offset commits resources
> >>> and reading pending_bytes frees them, allowing userspace to restart at
> >>> reading pending_bytes with no side-effects.  Therefore reading
> >>> pending_bytes repeatedly is supported.  Is the same true for
> >>> data_offset and data_size?  It seems reasonable that the vendor driver
> >>> can simply return offset and size for the current buffer if the user
> >>> reads these more than once.
> >>>     
> >>
> >> Right.  
> > 
> > Can we add that to the spec?
> >   
> 
> ok.
> 
> >>> How is a protocol or device error signaled?  For example, we can have a
> >>> user error where they read data_size before data_offset.  Should the
> >>> vendor driver generate a fault reading data_size in this case.  We can
> >>> also have internal errors in the vendor driver, should the vendor
> >>> driver use a special errno or update device_state autonomously to
> >>> indicate such an error?  
> >>
> >> If there is any error during the sequence, vendor driver can return
> >> error code for next read/write operation, that will terminate the loop
> >> and migration would fail.  
> > 
> > Please add to spec.
> >   
> 
> Ok
> 
> >>> I believe it's also part of the intended protocol that the user can
> >>> transition from _SAVING|_RUNNING to _SAVING at any point, regardless of
> >>> pending_bytes.  This should be noted.
> >>>      
> >>
> >> Ok. Updating comment.
> >>  
> >>>> + *
> >>>> + * Sequence to be followed while _RESUMING device state:
> >>>> + * While data for this device is available, repeat below steps:
> >>>> + * a. read data_offset from where user application should write data.
> >>>> + * b. write data of data_size to migration region from data_offset.  
> >>>
> >>> Whose's data_size, the _SAVING end or the _RESUMING end?  I think this
> >>> is intended to be the transaction size from the _SAVING source,  
> >>
> >> Not necessarily. data_size could be MIN(transaction size of source,
> >> migration data section). If migration data section is smaller than data
> >> packet size at source, then it has to be broken and iteratively sent.  
> > 
> > So you're saying that a transaction from the source is divisible by the
> > user under certain conditions.  What other conditions exist?  
> 
> I don't think there are any other conditions than above.
> 
> >  Can the
> > user decide arbitrary sizes less than the MIN() stated above?  This
> > needs to be specified.
> >  
> 
> No, User can't decide arbitrary sizes.

TBH, I'd expect a vendor driver that offers a different migration
region size, such that it becomes the user's responsibility to split
transactions should just claim it's not compatible with the source, as
determined by the previously defined compatibility protocol.  If we
really need this requirement, it needs to be justified and the exact
conditions under which the user performs this needs to be specified.

> >>> but it
> >>> could easily be misinterpreted as reading data_size on the _RESUMING
> >>> end.
> >>>      
> >>>> + * c. write data_size which indicates vendor driver that data is written in
> >>>> + *    staging buffer. Vendor driver should read this data from migration
> >>>> + *    region and resume device's state.  
> >>>
> >>> I think we also need to define the error protocol.  The user could
> >>> mis-order transactions or there could be an internal error in the
> >>> vendor driver or device.  Are all read(2)/write(2) operations
> >>> susceptible to defined errnos to signal this?  
> >>
> >> Yes.  
> > 
> > And those defined errnos are specified...
> >   
> 
> Those could be standard errors like -EINVAL, ENOMEM....

I thought we might specify specific errors to consistently indicate
non-continuable faults among vendor drivers.  Is anything other than
-EAGAIN considered non-fatal to the operation?  For example, could
EEXIST indicate duplicate data that the user has already written but
not be considered a fatal error?  Would EFAULT perhaps indicate a
continuable ordering error?  If any fault indicates the save/resume has
failed, shouldn't the user be able to see the device is in such a state
by reading device_state (except we have no state defined to indicate
that)?
 
> >>>   Is it reflected in
> >>> device_state?  
> >>
> >> No.  
> > 
> > So a user should do what, just keep trying?
> >  
> 
> No, fail migration process. If error is at source or destination then 
> user can decide either resume at source or terminate application.

This is describing the expected QEMU protocol resolution, the question
is relative to the vfio API we're defining here.  If any fault in the
save/resume protocol results in the device being unusable, there should
be an indication (perhaps through device_state) that the device is in a
broken state, and the mechanism to put it into a new state should be
defined.  For instance, if the device is resuming, a fault occurs
writing state data to the device, and the user transitions to running.
Is the device incorporating the partial state data into its run state?
I suspect not, and wouldn't that be more obvious if we defined a
protocol where the device can be inspected to be in a bogus state via
reading device_state, at which point we might define performing a
device reset as the only mechanism to change the device_state after
that point.

> >>> What's the recovery protocol?
> >>>      
> >>
> >> On read()/write() failure user should take necessary action.  
> > 
> > Where is that necessary action defined?  Can they just try again?  Do
> > they transition in and out of _RESUMING to try again?  Do they need to
> > reset the device?
> >   
> 
> User application should decide what action to take on failure, right?
>   "vfio is not prescribing the migration semantics to userspace, it's
> presenting an interface that support the user semantics."

Exactly, we're not defining how QEMU handles a fault in this spec,
we're defining how a user interacting with the device knows a fault has
occurred, can inspect the device to determine that the device is in a
broken state, and the "necessary action" to advance the device forward
to a new state.

> >>>> + *
> >>>> + * For user application, data is opaque. User should write data in the same
> >>>> + * order as received.  
> >>>
> >>> Order and transaction size, ie. each data_size chunk is indivisible by
> >>> the user.  
> >>
> >> Transaction size can differ, but order should remain same.  
> > 
> > Under what circumstances and to what extent can transaction size
> > differ?  
> 
> It depends in migration region size.
> 
> >  Is the MIN() algorithm above the absolute lower bound or just
> > a suggestion?  
> 
> 
> 
> >  Is the user allowed to concatenate transactions from the
> > source together on the target if the region is sufficiently large?  
> 
> Yes that can be done, because data is just byte stream for user. Vendor 
> driver receives the byte stream and knows how to decode it.

But that byte stream is opaque to the user, the vendor driver might
implement it such that every transaction has a header and splitting the
transaction might mean that the truncated transaction no longer fits
the expected size.  If we're lucky, the vendor driver's implementation
might detect that.  If we're not, the vendor driver might misinterpret
the next packet.  I think if the user is to consider all data as
opaque, they must also consider every transaction as indivisible or
else we're assuming something about the contents of that transaction.

> >  It
> > seems like quite an imposition on the vendor driver to support this
> > flexibility.
> >   
> >>>> + */
> >>>> +
> >>>> +struct vfio_device_migration_info {
> >>>> +	__u32 device_state;         /* VFIO device state */
> >>>> +#define VFIO_DEVICE_STATE_STOP      (1 << 0)
> >>>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)  
> >>>
> >>> Huh?  We should probably just refer to it consistently, ie. _RUNNING
> >>> and !_RUNNING, otherwise we have the incongruity that setting the _STOP
> >>> value is actually the opposite of the necessary logic value (_STOP = 1
> >>> is _RUNNING, _STOP = 0 is !_RUNNING).  
> >>
> >> Ops, my mistake, forgot to update to
> >> #define VFIO_DEVICE_STATE_STOP      (0)
> >>  
> >>>      
> >>>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> >>>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> >>>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> >>>> +				     VFIO_DEVICE_STATE_SAVING |  \
> >>>> +				     VFIO_DEVICE_STATE_RESUMING)
> >>>> +
> >>>> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
> >>>> +					    VFIO_DEVICE_STATE_RESUMING)
> >>>> +
> >>>> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
> >>>> +					    VFIO_DEVICE_STATE_RESUMING)  
> >>>
> >>> Gack, we fixed these in the last iteration!
> >>>      
> >>
> >> That solution doesn't scale when new flags will be added. I still prefer
> >> to define as above.  
> > 
> > I see, the argument was buried in a reply to Yan, sorry if I missed it:
> >   
> >>>> These seem difficult to use, maybe we just need a
> >>>> VFIO_DEVICE_STATE_VALID macro?
> >>>>
> >>>> #define VFIO_DEVICE_STATE_VALID(state) \
> >>>>     (state & VFIO_DEVICE_STATE_RESUMING ? \
> >>>>     (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> >>>>     
> >>
> >> This will not be work when use of other bits gets added in future.
> >> That's the reason I preferred to add individual invalid states which
> >> user should check.  
> > 
> > I would argue that what doesn't scale is having numerous CASE1, CASE2,
> > CASEn conditions elsewhere in the kernel rather than have a unified,
> > single macro that defines a valid state.  How do you worry this will be
> > a problem when new flags are added, can't we just update the macro?  
> 
> Adding macro you suggested above. Lets figure out how to solve problem 
> with new flags when new flags gets added.

Agree, I think it's sufficient to assume we'll update the macro at that
time.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-18 21:39       ` Alex Williamson
@ 2019-12-19 18:42         ` Kirti Wankhede
  2019-12-19 18:56           ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-19 18:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yan Zhao, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu,
	Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm



On 12/19/2019 3:09 AM, Alex Williamson wrote:
> On Tue, 17 Dec 2019 14:54:14 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 12/17/2019 10:45 AM, Yan Zhao wrote:
>>> On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:
>>>> +		} else if (range.flags &
>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>>>> +			uint64_t iommu_pgmask;
>>>> +			unsigned long pgshift = __ffs(range.pgsize);
>>>> +			unsigned long *bitmap;
>>>> +			long bsize;
>>>> +
>>>> +			iommu_pgmask =
>>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>>>> +
>>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
>>>> +			    (range.pgsize - 1))
>>>> +				return -EINVAL;
>>>> +
>>>> +			if (range.iova & iommu_pgmask)
>>>> +				return -EINVAL;
>>>> +			if (!range.size || range.size > SIZE_MAX)
>>>> +				return -EINVAL;
>>>> +			if (range.iova + range.size < range.iova)
>>>> +				return -EINVAL;
>>>> +
>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>>>> +						   range.bitmap_size);
>>>> +			if (bsize)
>>>> +				return ret;
>>>> +
>>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
>>>> +			if (!bitmap)
>>>> +				return -ENOMEM;
>>>> +
>>>> +			ret = copy_from_user(bitmap,
>>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
>>>> +			if (ret)
>>>> +				goto bitmap_exit;
>>>> +
>>>> +			iommu->dirty_page_tracking = false;
>>> why iommu->dirty_page_tracking is false here?
>>> suppose this ioctl can be called several times.
>>>    
>>
>> This ioctl can be called several times, but once this ioctl is called
>> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
>> stop-and-copy phase) and dirty pages bitmap are being queried by user.
> 
> Do not assume how userspace works or its intent.  If dirty tracking is
> on, it should remain on until the user turns it off.  We cannot assume
> userspace uses a one-shot approach.  Thanks,
> 

Dirty tracking should be on until user turns it off or user reads 
bitmap, right? This ioctl is used to read bitmap.


Thanks,
Kirti




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-19 18:42         ` Kirti Wankhede
@ 2019-12-19 18:56           ` Alex Williamson
  0 siblings, 0 replies; 44+ messages in thread
From: Alex Williamson @ 2019-12-19 18:56 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Yan Zhao, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu,
	Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Fri, 20 Dec 2019 00:12:30 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/19/2019 3:09 AM, Alex Williamson wrote:
> > On Tue, 17 Dec 2019 14:54:14 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 12/17/2019 10:45 AM, Yan Zhao wrote:  
> >>> On Tue, Dec 17, 2019 at 04:21:39AM +0800, Kirti Wankhede wrote:  
> >>>> +		} else if (range.flags &
> >>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >>>> +			uint64_t iommu_pgmask;
> >>>> +			unsigned long pgshift = __ffs(range.pgsize);
> >>>> +			unsigned long *bitmap;
> >>>> +			long bsize;
> >>>> +
> >>>> +			iommu_pgmask =
> >>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >>>> +
> >>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> >>>> +			    (range.pgsize - 1))
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			if (range.iova & iommu_pgmask)
> >>>> +				return -EINVAL;
> >>>> +			if (!range.size || range.size > SIZE_MAX)
> >>>> +				return -EINVAL;
> >>>> +			if (range.iova + range.size < range.iova)
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >>>> +						   range.bitmap_size);
> >>>> +			if (bsize)
> >>>> +				return ret;
> >>>> +
> >>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> >>>> +			if (!bitmap)
> >>>> +				return -ENOMEM;
> >>>> +
> >>>> +			ret = copy_from_user(bitmap,
> >>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> >>>> +			if (ret)
> >>>> +				goto bitmap_exit;
> >>>> +
> >>>> +			iommu->dirty_page_tracking = false;  
> >>> why iommu->dirty_page_tracking is false here?
> >>> suppose this ioctl can be called several times.
> >>>      
> >>
> >> This ioctl can be called several times, but once this ioctl is called
> >> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
> >> stop-and-copy phase) and dirty pages bitmap are being queried by user.  
> > 
> > Do not assume how userspace works or its intent.  If dirty tracking is
> > on, it should remain on until the user turns it off.  We cannot assume
> > userspace uses a one-shot approach.  Thanks,
> >   
> 
> Dirty tracking should be on until user turns it off or user reads 
> bitmap, right? This ioctl is used to read bitmap.

No, dirty bitmap tracking is on until the user turns it off, period.
Retrieving the bitmap is probably only looking at a portion of the
container address space at a time, anything else would place
impractical requirements on the user allocated bitmap.  We also need to
support a usage model where the user is making successive calls, where
each should report pages dirtied since the previous call.  If the user
is required to re-enable tracking, there's an irreconcilable gap
between the call to retrieve the dirty bitmap and their opportunity to
re-enable dirty tracking.  It's fundamentally broken to automatically
disable tracking on read.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-19 17:27           ` Alex Williamson
@ 2019-12-19 20:10             ` Kirti Wankhede
  2019-12-19 21:09               ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2019-12-19 20:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 12/19/2019 10:57 PM, Alex Williamson wrote:

<Snip>


>>>>>> + * 2. When VFIO application save state or suspend application, VFIO device
>>>>>> + *    state transition is from _RUNNING to stop-and-copy state and then to
>>>>>> + *    _STOP.
>>>>>> + *    On state transition from _RUNNING to stop-and-copy, driver must
>>>>>> + *    stop device, save device state and send it to application through
>>>>>> + *    migration region.
>>>>>> + *    On _RUNNING to stop-and-copy state transition failure, application should
>>>>>> + *    set VFIO device state to _RUNNING.
>>>>>
>>>>> A state transition failure means that the user's write to device_state
>>>>> failed, so is it the user's responsibility to set the next state?
>>>>
>>>> Right.
>>>
>>> If a transition failure occurs, ie. errno from write(2), what value is
>>> reported by a read(2) of device_state in the interim between the failure
>>> and a next state written by the user?
>>
>> Since state transition has failed, driver should return previous state.
>>
>>> If this is a valid state,
>>> wouldn't it be reasonable for the user to assume the device is already
>>> operating in that state?
> 
> This ^^^
> 
>>   If it's an invalid state, do we need to
>>> define the use cases for those invalid states?  If the user needs to
>>> set the state back to _RUNNING, that suggests the device might be
>>> stopped, which has implications beyond the migration state.
>>>    
>>
>> Not necessarily stopped. For example, during live migration:
>>
>> *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
>> *                (100b)     (001b)     (011b)        (010b)       (000b)
>> *
>> * 3. Save state during live migration
>> *                             |----------->|------------>|---------->|
>>
>> on any state transition failure, user should set _RUNNING state.
>> pre-copy (011b) -> stop-and-copy(010b)  =====> _SAVING flag is cleared
>> and device returned back to _RUNNING.
>> Stop-and-copy(010b) -> STOP (000b) ====> device is already stopped.
> 
> IMO, the user may modify the state, but the vendor driver should report
> the current state of the device via device_state, which the user may
> read after a transition error occurs.  The spec lacks a provision for
> indicating the device is in a non-functional state.
>

Are you proposing to add a bit in device state to report error?

#define VFIO_DEVICE_STATE_ERROR  (1 << 3)

which can be set by vendor driver and when its set, other bits set/clear 
doesn't matter.


>>>>>    Why
>>>>> is it necessarily _RUNNING vs _STOP?
>>>>>      
>>>>
>>>> While changing From pre-copy to stop-and-copy transition, device is
>>>> still running, only saving of device state started. Now if transition to
>>>> stop-and-copy fails, from user point of view application or VM is still
>>>> running, device state should be set to _RUNNING so that whatever the
>>>> application/VM is running should continue at source.
>>>
>>> Seems it's the users discretion whether to consider this continuable or
>>> fatal, the vfio interface specification should support a given usage
>>> model, not prescribe it.
>>>    
>>
>> Updating comment.
>>
>>>>>> + * 3. In VFIO application live migration, state transition is from _RUNNING
>>>>>> + *    to pre-copy to stop-and-copy to _STOP.
>>>>>> + *    On state transition from _RUNNING to pre-copy, driver should start
>>>>>> + *    gathering device state while application is still running and send device
>>>>>> + *    state data to application through migration region.
>>>>>> + *    On state transition from pre-copy to stop-and-copy, driver must stop
>>>>>> + *    device, save device state and send it to application through migration
>>>>>> + *    region.
>>>>>> + *    On any failure during any of these state transition, VFIO device state
>>>>>> + *    should be set to _RUNNING.
>>>>>
>>>>> Same comment as above regarding next state on failure.
>>>>>       
>>>>
>>>> If application or VM migration fails, it should continue to run at
>>>> source. In case of VM, guest user isn't aware of migration, and from his
>>>> point VM should be running.
>>>
>>> vfio is not prescribing the migration semantics to userspace, it's
>>> presenting an interface that support the user semantics.  Therefore,
>>> while it's useful to understand the expected usage model, I think we
>>> also need a mechanism that the user can always determine the
>>> device_state after a fault
>>
>> If state transition fails, device is in previous state and driver should
>> return previous state
> 
> Then why is it stated that the user needs to set the _RUNNING state?
> It's the user's choice.  But I do think we're lacking a state to
> indicate an internal fault.
>


If device state it at pre-copy state (011b).
Transition, i.e., write to device state as stop-and-copy state (010b) 
failed, then by previous state I meant device should return pre-copy 
state(011b), i.e. previous state which was successfully set, or as you 
said current state which was successfully set.


>>> and allowable state transitions independent
>>> of the expected usage model.
>>
>> Do you mean to define array of ['from','to'], same as runstate
>> transition array in QEMU?
>>    static const RunStateTransition runstate_transitions_def[]
> 
> I'm thinking that independent of expected QEMU usage models, are there
> any invalid transitions or is every state reachable from every other
> state.  I'm afraid this design is so focused on a specific usage model
> that vendor drivers are going to fall over if the user invokes a
> transition outside of those listed above.  If there are invalid
> transitions, those should be listed so they can be handled
> consistently.  If there are no invalid transitions, it should be noted
> in the spec to encourage vendor drivers to expect this.
> 

I think vendor driver can decide which state transitions it can support, 
rather than defining/prescribing that all.
Suppose, if vendor driver doesn't want to support save-restore 
functionality, then vendor driver can return error -EINVAL for write() 
operation on device_state for transition from _RUNNING to 
stop-and-copy(010b) state.

>>> For example, I think a user should always
>>> be allowed to transition a device to stopped regardless of the expected
>>> migration flow.  An error might have occurred elsewhere and we want to
>>> stop everything for debugging.  I think it's also allowable to switch
>>> directly from running to stop-and-copy, for example to save and resume
>>> a VM offline.
>>>      
>>>>> Also, it seems like it's the vendor driver's discretion to actually
>>>>> provide data during the pre-copy phase.  As we've defined it, the
>>>>> vendor driver needs to participate in the migration region regardless,
>>>>> they might just always report no pending_bytes until we enter
>>>>> stop-and-copy.
>>>>>       
>>>>
>>>> Yes. And if pending_bytes are reported as 0 in pre-copy by vendor driver
>>>> then QEMU doesn't reiterate for that device.
>>>
>>> Maybe we can state that as the expected mechanism to avoid a vendor
>>> driver trying to invent alternative means, ex. failing transition to
>>> pre-copy, requesting new flags, etc.
>>>    
>>
>> Isn't Sequence to be followed below sufficient to state that?
> 
> I think we understand it because we've been discussing it so long, but
> without that background it could be subtle.
>   
>>>>>> + * 4. To start resuming phase, VFIO device state should be transitioned from
>>>>>> + *    _RUNNING to _RESUMING state.
>>>>>> + *    In _RESUMING state, driver should use received device state data through
>>>>>> + *    migration region to resume device.
>>>>>> + *    On failure during this state transition, application should set _RUNNING
>>>>>> + *    state.
>>>>>
>>>>> Same comment regarding setting next state after failure.
>>>>
>>>> If device couldn't be transitioned to _RESUMING, then it should be set
>>>> to default state, that is _RUNNING.
>>>>   
>>>>>       
>>>>>> + * 5. On providing saved device data to driver, appliation should change state
>>>>>> + *    from _RESUMING to _RUNNING.
>>>>>> + *    On failure to transition to _RUNNING state, VFIO application should reset
>>>>>> + *    the device and set _RUNNING state so that device doesn't remain in unknown
>>>>>> + *    or bad state. On reset, driver must reset device and device should be
>>>>>> + *    available in default usable state.
>>>>>
>>>>> Didn't we discuss that the reset ioctl should return the device to the
>>>>> initial state, including the transition to _RUNNING?
>>>>
>>>> Yes, that's default usable state, rewording it to initial state.
>>>>   
>>>>>    Also, as above,
>>>>> it's the user write that triggers the failure, this register is listed
>>>>> as read-write, so what value does the vendor driver report for the
>>>>> state when read after a transition failure?  Is it reported as _RESUMING
>>>>> as it was prior to the attempted transition, or may the invalid states
>>>>> be used by the vendor driver to indicate the device is broken?
>>>>>       
>>>>
>>>> If transition as failed, device should report its previous state and
>>>> reset device should bring back to usable _RUNNING state.
>>>
>>> If device_state reports previous state then user should reasonably
>>> infer that the device is already in that sate without a need for them
>>> to set it, IMO.
>>
>> But if there is any error in read()/write() then user should device
>> which next state device should be put in, which would be different that
>> previous state.
> 
> That's a different answer than telling the user the next state should
> be _RUNNING.
> 
>>>>>> + *
>>>>>> + * pending bytes: (read only)
>>>>>> + *      Number of pending bytes yet to be migrated from vendor driver
>>>>>> + *
>>>>>> + * data_offset: (read only)
>>>>>> + *      User application should read data_offset in migration region from where
>>>>>> + *      user application should read device data during _SAVING state or write
>>>>>> + *      device data during _RESUMING state. See below for detail of sequence to
>>>>>> + *      be followed.
>>>>>> + *
>>>>>> + * data_size: (read/write)
>>>>>> + *      User application should read data_size to get size of data copied in
>>>>>> + *      bytes in migration region during _SAVING state and write size of data
>>>>>> + *      copied in bytes in migration region during _RESUMING state.
>>>>>> + *
>>>>>> + * Migration region looks like:
>>>>>> + *  ------------------------------------------------------------------
>>>>>> + * |vfio_device_migration_info|    data section                      |
>>>>>> + * |                          |     ///////////////////////////////  |
>>>>>> + * ------------------------------------------------------------------
>>>>>> + *   ^                              ^
>>>>>> + *  offset 0-trapped part        data_offset
>>>>>> + *
>>>>>> + * Structure vfio_device_migration_info is always followed by data section in
>>>>>> + * the region, so data_offset will always be non-0. Offset from where data is
>>>>>> + * copied is decided by kernel driver, data section can be trapped or mapped
>>>>>> + * or partitioned, depending on how kernel driver defines data section.
>>>>>> + * Data section partition can be defined as mapped by sparse mmap capability.
>>>>>> + * If mmapped, then data_offset should be page aligned, where as initial section
>>>>>> + * which contain vfio_device_migration_info structure might not end at offset
>>>>>> + * which is page aligned. The user is not required to access via mmap regardless
>>>>>> + * of the region mmap capabilities.
>>>>>> + * Vendor driver should decide whether to partition data section and how to
>>>>>> + * partition the data section. Vendor driver should return data_offset
>>>>>> + * accordingly.
>>>>>> + *
>>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>>>> + * a. read pending_bytes, indicates start of new iteration to get device data.
>>>>>> + *    If there was previous iteration, then this read operation indicates
>>>>>> + *    previous iteration is done. If pending_bytes > 0, go through below steps.
>>>>>> + * b. read data_offset, indicates kernel driver to make data available through
>>>>>> + *    data section. Kernel driver should return this read operation only after
>>>>>> + *    data is available from (region + data_offset) to (region + data_offset +
>>>>>> + *    data_size).
>>>>>> + * c. read data_size, amount of data in bytes available through migration
>>>>>> + *    region.
>>>>>> + * d. read data of data_size bytes from (region + data_offset) from migration
>>>>>> + *    region.
>>>>>> + * e. process data.
>>>>>> + * f. Loop through a to e.
>>>>>
>>>>> It seems we always need to end an iteration by reading pending_bytes to
>>>>> signal to the vendor driver to release resources, so should the end of
>>>>> the loop be:
>>>>>
>>>>> e. Read pending_bytes
>>>>> f. Goto b. or optionally restart next iteration at a.
>>>>>
>>>>> I think this is defined such that reading data_offset commits resources
>>>>> and reading pending_bytes frees them, allowing userspace to restart at
>>>>> reading pending_bytes with no side-effects.  Therefore reading
>>>>> pending_bytes repeatedly is supported.  Is the same true for
>>>>> data_offset and data_size?  It seems reasonable that the vendor driver
>>>>> can simply return offset and size for the current buffer if the user
>>>>> reads these more than once.
>>>>>      
>>>>
>>>> Right.
>>>
>>> Can we add that to the spec?
>>>    
>>
>> ok.
>>
>>>>> How is a protocol or device error signaled?  For example, we can have a
>>>>> user error where they read data_size before data_offset.  Should the
>>>>> vendor driver generate a fault reading data_size in this case.  We can
>>>>> also have internal errors in the vendor driver, should the vendor
>>>>> driver use a special errno or update device_state autonomously to
>>>>> indicate such an error?
>>>>
>>>> If there is any error during the sequence, vendor driver can return
>>>> error code for next read/write operation, that will terminate the loop
>>>> and migration would fail.
>>>
>>> Please add to spec.
>>>    
>>
>> Ok
>>
>>>>> I believe it's also part of the intended protocol that the user can
>>>>> transition from _SAVING|_RUNNING to _SAVING at any point, regardless of
>>>>> pending_bytes.  This should be noted.
>>>>>       
>>>>
>>>> Ok. Updating comment.
>>>>   
>>>>>> + *
>>>>>> + * Sequence to be followed while _RESUMING device state:
>>>>>> + * While data for this device is available, repeat below steps:
>>>>>> + * a. read data_offset from where user application should write data.
>>>>>> + * b. write data of data_size to migration region from data_offset.
>>>>>
>>>>> Whose's data_size, the _SAVING end or the _RESUMING end?  I think this
>>>>> is intended to be the transaction size from the _SAVING source,
>>>>
>>>> Not necessarily. data_size could be MIN(transaction size of source,
>>>> migration data section). If migration data section is smaller than data
>>>> packet size at source, then it has to be broken and iteratively sent.
>>>
>>> So you're saying that a transaction from the source is divisible by the
>>> user under certain conditions.  What other conditions exist?
>>
>> I don't think there are any other conditions than above.
>>
>>>   Can the
>>> user decide arbitrary sizes less than the MIN() stated above?  This
>>> needs to be specified.
>>>   
>>
>> No, User can't decide arbitrary sizes.
> 
> TBH, I'd expect a vendor driver that offers a different migration
> region size, such that it becomes the user's responsibility to split
> transactions should just claim it's not compatible with the source, as
> determined by the previously defined compatibility protocol.  If we
> really need this requirement, it needs to be justified and the exact
> conditions under which the user performs this needs to be specified.
> 

Let User decide whether it wants to support different migration region 
sizes at source and destination or not instead of putting hard requirement.

>>>>> but it
>>>>> could easily be misinterpreted as reading data_size on the _RESUMING
>>>>> end.
>>>>>       
>>>>>> + * c. write data_size which indicates vendor driver that data is written in
>>>>>> + *    staging buffer. Vendor driver should read this data from migration
>>>>>> + *    region and resume device's state.
>>>>>
>>>>> I think we also need to define the error protocol.  The user could
>>>>> mis-order transactions or there could be an internal error in the
>>>>> vendor driver or device.  Are all read(2)/write(2) operations
>>>>> susceptible to defined errnos to signal this?
>>>>
>>>> Yes.
>>>
>>> And those defined errnos are specified...
>>>    
>>
>> Those could be standard errors like -EINVAL, ENOMEM....
> 
> I thought we might specify specific errors to consistently indicate
> non-continuable faults among vendor drivers.  Is anything other than
> -EAGAIN considered non-fatal to the operation?  For example, could
> EEXIST indicate duplicate data that the user has already written but
> not be considered a fatal error?  Would EFAULT perhaps indicate a
> continuable ordering error?  If any fault indicates the save/resume has
> failed, shouldn't the user be able to see the device is in such a state
> by reading device_state (except we have no state defined to indicate
> that)?
>   

Do we have to define all standard errors returned here would mean meant 
what?

Right from initial versions of migration reviews we always thought that 
device_state should be only set by user, vendor driver could return 
error state was never thought of. Returning error to read()/write() 
operation indicate that device is not able to handle that operation so 
user will decide what action to be taken next.
Now you are proposing to add a state that vendor driver can set, as 
defined in my above comment?

>>>>>    Is it reflected in
>>>>> device_state?
>>>>
>>>> No.
>>>
>>> So a user should do what, just keep trying?
>>>   
>>
>> No, fail migration process. If error is at source or destination then
>> user can decide either resume at source or terminate application.
> 
> This is describing the expected QEMU protocol resolution, the question
> is relative to the vfio API we're defining here.  If any fault in the
> save/resume protocol results in the device being unusable, there should
> be an indication (perhaps through device_state) that the device is in a
> broken state, and the mechanism to put it into a new state should be
> defined.  For instance, if the device is resuming, a fault occurs
> writing state data to the device, and the user transitions to running.
> Is the device incorporating the partial state data into its run state?
> I suspect not, and wouldn't that be more obvious if we defined a
> protocol where the device can be inspected to be in a bogus state via
> reading device_state, at which point we might define performing a
> device reset as the only mechanism to change the device_state after
> that point.
>

Same as above my comment.

>>>>> What's the recovery protocol?
>>>>>       
>>>>
>>>> On read()/write() failure user should take necessary action.
>>>
>>> Where is that necessary action defined?  Can they just try again?  Do
>>> they transition in and out of _RESUMING to try again?  Do they need to
>>> reset the device?
>>>    
>>
>> User application should decide what action to take on failure, right?
>>    "vfio is not prescribing the migration semantics to userspace, it's
>> presenting an interface that support the user semantics."
> 
> Exactly, we're not defining how QEMU handles a fault in this spec,
> we're defining how a user interacting with the device knows a fault has
> occurred, can inspect the device to determine that the device is in a
> broken state, and the "necessary action" to advance the device forward
> to a new state.
> 
>>>>>> + *
>>>>>> + * For user application, data is opaque. User should write data in the same
>>>>>> + * order as received.
>>>>>
>>>>> Order and transaction size, ie. each data_size chunk is indivisible by
>>>>> the user.
>>>>
>>>> Transaction size can differ, but order should remain same.
>>>
>>> Under what circumstances and to what extent can transaction size
>>> differ?
>>
>> It depends in migration region size.
>>
>>>   Is the MIN() algorithm above the absolute lower bound or just
>>> a suggestion?
>>
>>
>>
>>>   Is the user allowed to concatenate transactions from the
>>> source together on the target if the region is sufficiently large?
>>
>> Yes that can be done, because data is just byte stream for user. Vendor
>> driver receives the byte stream and knows how to decode it.
> 
> But that byte stream is opaque to the user, the vendor driver might
> implement it such that every transaction has a header and splitting the
> transaction might mean that the truncated transaction no longer fits
> the expected size.  If we're lucky, the vendor driver's implementation
> might detect that.  If we're not, the vendor driver might misinterpret
> the next packet.  I think if the user is to consider all data as
> opaque, they must also consider every transaction as indivisible or
> else we're assuming something about the contents of that transaction.
> 

User shouldn't assume about contents of transactions.
I think vendor driver should consider incomming data as data byte stream 
and decoding packets should not be based on migration region size.


Thanks,
Kirti

>>>   It
>>> seems like quite an imposition on the vendor driver to support this
>>> flexibility.
>>>    
>>>>>> + */
>>>>>> +
>>>>>> +struct vfio_device_migration_info {
>>>>>> +	__u32 device_state;         /* VFIO device state */
>>>>>> +#define VFIO_DEVICE_STATE_STOP      (1 << 0)
>>>>>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>>>>>
>>>>> Huh?  We should probably just refer to it consistently, ie. _RUNNING
>>>>> and !_RUNNING, otherwise we have the incongruity that setting the _STOP
>>>>> value is actually the opposite of the necessary logic value (_STOP = 1
>>>>> is _RUNNING, _STOP = 0 is !_RUNNING).
>>>>
>>>> Ops, my mistake, forgot to update to
>>>> #define VFIO_DEVICE_STATE_STOP      (0)
>>>>   
>>>>>       
>>>>>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>>>>>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>>>>>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>>>>>> +				     VFIO_DEVICE_STATE_SAVING |  \
>>>>>> +				     VFIO_DEVICE_STATE_RESUMING)
>>>>>> +
>>>>>> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
>>>>>> +					    VFIO_DEVICE_STATE_RESUMING)
>>>>>> +
>>>>>> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
>>>>>> +					    VFIO_DEVICE_STATE_RESUMING)
>>>>>
>>>>> Gack, we fixed these in the last iteration!
>>>>>       
>>>>
>>>> That solution doesn't scale when new flags will be added. I still prefer
>>>> to define as above.
>>>
>>> I see, the argument was buried in a reply to Yan, sorry if I missed it:
>>>    
>>>>>> These seem difficult to use, maybe we just need a
>>>>>> VFIO_DEVICE_STATE_VALID macro?
>>>>>>
>>>>>> #define VFIO_DEVICE_STATE_VALID(state) \
>>>>>>      (state & VFIO_DEVICE_STATE_RESUMING ? \
>>>>>>      (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
>>>>>>      
>>>>
>>>> This will not be work when use of other bits gets added in future.
>>>> That's the reason I preferred to add individual invalid states which
>>>> user should check.
>>>
>>> I would argue that what doesn't scale is having numerous CASE1, CASE2,
>>> CASEn conditions elsewhere in the kernel rather than have a unified,
>>> single macro that defines a valid state.  How do you worry this will be
>>> a problem when new flags are added, can't we just update the macro?
>>
>> Adding macro you suggested above. Lets figure out how to solve problem
>> with new flags when new flags gets added.
> 
> Agree, I think it's sufficient to assume we'll update the macro at that
> time.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-19 20:10             ` Kirti Wankhede
@ 2019-12-19 21:09               ` Alex Williamson
  2020-01-02 18:25                 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2019-12-19 21:09 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Fri, 20 Dec 2019 01:40:35 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/19/2019 10:57 PM, Alex Williamson wrote:
> 
> <Snip>
> 
> 
> >>>>>> + * 2. When VFIO application save state or suspend application, VFIO device
> >>>>>> + *    state transition is from _RUNNING to stop-and-copy state and then to
> >>>>>> + *    _STOP.
> >>>>>> + *    On state transition from _RUNNING to stop-and-copy, driver must
> >>>>>> + *    stop device, save device state and send it to application through
> >>>>>> + *    migration region.
> >>>>>> + *    On _RUNNING to stop-and-copy state transition failure, application should
> >>>>>> + *    set VFIO device state to _RUNNING.  
> >>>>>
> >>>>> A state transition failure means that the user's write to device_state
> >>>>> failed, so is it the user's responsibility to set the next state?  
> >>>>
> >>>> Right.  
> >>>
> >>> If a transition failure occurs, ie. errno from write(2), what value is
> >>> reported by a read(2) of device_state in the interim between the failure
> >>> and a next state written by the user?  
> >>
> >> Since state transition has failed, driver should return previous state.
> >>  
> >>> If this is a valid state,
> >>> wouldn't it be reasonable for the user to assume the device is already
> >>> operating in that state?  
> > 
> > This ^^^
> >   
> >>   If it's an invalid state, do we need to  
> >>> define the use cases for those invalid states?  If the user needs to
> >>> set the state back to _RUNNING, that suggests the device might be
> >>> stopped, which has implications beyond the migration state.
> >>>      
> >>
> >> Not necessarily stopped. For example, during live migration:
> >>
> >> *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> >> *                (100b)     (001b)     (011b)        (010b)       (000b)
> >> *
> >> * 3. Save state during live migration
> >> *                             |----------->|------------>|---------->|
> >>
> >> on any state transition failure, user should set _RUNNING state.
> >> pre-copy (011b) -> stop-and-copy(010b)  =====> _SAVING flag is cleared
> >> and device returned back to _RUNNING.
> >> Stop-and-copy(010b) -> STOP (000b) ====> device is already stopped.  
> > 
> > IMO, the user may modify the state, but the vendor driver should report
> > the current state of the device via device_state, which the user may
> > read after a transition error occurs.  The spec lacks a provision for
> > indicating the device is in a non-functional state.
> >  
> 
> Are you proposing to add a bit in device state to report error?
> 
> #define VFIO_DEVICE_STATE_ERROR  (1 << 3)
> 
> which can be set by vendor driver and when its set, other bits set/clear 
> doesn't matter.

We can represent an invalid state with the bits we've defined, for
instance 110b (_SAVING|_RESTORING) is bogus.  We could define that as a
state the vendor driver can use to report the device is in an error
condition.

> >>>>>    Why
> >>>>> is it necessarily _RUNNING vs _STOP?
> >>>>>        
> >>>>
> >>>> While changing From pre-copy to stop-and-copy transition, device is
> >>>> still running, only saving of device state started. Now if transition to
> >>>> stop-and-copy fails, from user point of view application or VM is still
> >>>> running, device state should be set to _RUNNING so that whatever the
> >>>> application/VM is running should continue at source.  
> >>>
> >>> Seems it's the users discretion whether to consider this continuable or
> >>> fatal, the vfio interface specification should support a given usage
> >>> model, not prescribe it.
> >>>      
> >>
> >> Updating comment.
> >>  
> >>>>>> + * 3. In VFIO application live migration, state transition is from _RUNNING
> >>>>>> + *    to pre-copy to stop-and-copy to _STOP.
> >>>>>> + *    On state transition from _RUNNING to pre-copy, driver should start
> >>>>>> + *    gathering device state while application is still running and send device
> >>>>>> + *    state data to application through migration region.
> >>>>>> + *    On state transition from pre-copy to stop-and-copy, driver must stop
> >>>>>> + *    device, save device state and send it to application through migration
> >>>>>> + *    region.
> >>>>>> + *    On any failure during any of these state transition, VFIO device state
> >>>>>> + *    should be set to _RUNNING.  
> >>>>>
> >>>>> Same comment as above regarding next state on failure.
> >>>>>         
> >>>>
> >>>> If application or VM migration fails, it should continue to run at
> >>>> source. In case of VM, guest user isn't aware of migration, and from his
> >>>> point VM should be running.  
> >>>
> >>> vfio is not prescribing the migration semantics to userspace, it's
> >>> presenting an interface that support the user semantics.  Therefore,
> >>> while it's useful to understand the expected usage model, I think we
> >>> also need a mechanism that the user can always determine the
> >>> device_state after a fault  
> >>
> >> If state transition fails, device is in previous state and driver should
> >> return previous state  
> > 
> > Then why is it stated that the user needs to set the _RUNNING state?
> > It's the user's choice.  But I do think we're lacking a state to
> > indicate an internal fault.
> >  
> 
> 
> If device state it at pre-copy state (011b).
> Transition, i.e., write to device state as stop-and-copy state (010b) 
> failed, then by previous state I meant device should return pre-copy 
> state(011b), i.e. previous state which was successfully set, or as you 
> said current state which was successfully set.

Yes, the point I'm trying to make is that this version of the spec
tries to tell the user what they should do upon error according to our
current interpretation of the QEMU migration protocol.  We're not
defining the QEMU migration protocol, we're defining something that can
be used in a way to support that protocol.  So I think we should be
concerned with defining our spec, for example my proposal would be: "If
a state transition fails the user can read device_state to determine the
current state of the device.  This should be the previous state of the
device unless the vendor driver has encountered an internal error, in
which case the device may report the invalid device_state 110b.  The
user must use the device reset ioctl in order to recover the device
from this state.  If the device is indicated in a valid device state
via reading device_state, the user may attempt to transition the device
to any valid state reachable from the current state."

> >>> and allowable state transitions independent
> >>> of the expected usage model.  
> >>
> >> Do you mean to define array of ['from','to'], same as runstate
> >> transition array in QEMU?
> >>    static const RunStateTransition runstate_transitions_def[]  
> > 
> > I'm thinking that independent of expected QEMU usage models, are there
> > any invalid transitions or is every state reachable from every other
> > state.  I'm afraid this design is so focused on a specific usage model
> > that vendor drivers are going to fall over if the user invokes a
> > transition outside of those listed above.  If there are invalid
> > transitions, those should be listed so they can be handled
> > consistently.  If there are no invalid transitions, it should be noted
> > in the spec to encourage vendor drivers to expect this.
> >   
> 
> I think vendor driver can decide which state transitions it can support, 
> rather than defining/prescribing that all.
> Suppose, if vendor driver doesn't want to support save-restore 
> functionality, then vendor driver can return error -EINVAL for write() 
> operation on device_state for transition from _RUNNING to 
> stop-and-copy(010b) state.

This is unsupportable.  If the vendor driver doesn't want to support
save-restore then they simply do not implement the migration
extensions.  If they expose this interface then the user (QEMU) will
rightfully assume that the device supports migration, only to find out
upon trying to use it that it's unsupported, or maybe broken.

> >>> For example, I think a user should always
> >>> be allowed to transition a device to stopped regardless of the expected
> >>> migration flow.  An error might have occurred elsewhere and we want to
> >>> stop everything for debugging.  I think it's also allowable to switch
> >>> directly from running to stop-and-copy, for example to save and resume
> >>> a VM offline.
> >>>        
> >>>>> Also, it seems like it's the vendor driver's discretion to actually
> >>>>> provide data during the pre-copy phase.  As we've defined it, the
> >>>>> vendor driver needs to participate in the migration region regardless,
> >>>>> they might just always report no pending_bytes until we enter
> >>>>> stop-and-copy.
> >>>>>         
> >>>>
> >>>> Yes. And if pending_bytes are reported as 0 in pre-copy by vendor driver
> >>>> then QEMU doesn't reiterate for that device.  
> >>>
> >>> Maybe we can state that as the expected mechanism to avoid a vendor
> >>> driver trying to invent alternative means, ex. failing transition to
> >>> pre-copy, requesting new flags, etc.
> >>>      
> >>
> >> Isn't Sequence to be followed below sufficient to state that?  
> > 
> > I think we understand it because we've been discussing it so long, but
> > without that background it could be subtle.
> >     
> >>>>>> + * 4. To start resuming phase, VFIO device state should be transitioned from
> >>>>>> + *    _RUNNING to _RESUMING state.
> >>>>>> + *    In _RESUMING state, driver should use received device state data through
> >>>>>> + *    migration region to resume device.
> >>>>>> + *    On failure during this state transition, application should set _RUNNING
> >>>>>> + *    state.  
> >>>>>
> >>>>> Same comment regarding setting next state after failure.  
> >>>>
> >>>> If device couldn't be transitioned to _RESUMING, then it should be set
> >>>> to default state, that is _RUNNING.
> >>>>     
> >>>>>         
> >>>>>> + * 5. On providing saved device data to driver, appliation should change state
> >>>>>> + *    from _RESUMING to _RUNNING.
> >>>>>> + *    On failure to transition to _RUNNING state, VFIO application should reset
> >>>>>> + *    the device and set _RUNNING state so that device doesn't remain in unknown
> >>>>>> + *    or bad state. On reset, driver must reset device and device should be
> >>>>>> + *    available in default usable state.  
> >>>>>
> >>>>> Didn't we discuss that the reset ioctl should return the device to the
> >>>>> initial state, including the transition to _RUNNING?  
> >>>>
> >>>> Yes, that's default usable state, rewording it to initial state.
> >>>>     
> >>>>>    Also, as above,
> >>>>> it's the user write that triggers the failure, this register is listed
> >>>>> as read-write, so what value does the vendor driver report for the
> >>>>> state when read after a transition failure?  Is it reported as _RESUMING
> >>>>> as it was prior to the attempted transition, or may the invalid states
> >>>>> be used by the vendor driver to indicate the device is broken?
> >>>>>         
> >>>>
> >>>> If transition as failed, device should report its previous state and
> >>>> reset device should bring back to usable _RUNNING state.  
> >>>
> >>> If device_state reports previous state then user should reasonably
> >>> infer that the device is already in that sate without a need for them
> >>> to set it, IMO.  
> >>
> >> But if there is any error in read()/write() then user should device
> >> which next state device should be put in, which would be different that
> >> previous state.  
> > 
> > That's a different answer than telling the user the next state should
> > be _RUNNING.
> >   
> >>>>>> + *
> >>>>>> + * pending bytes: (read only)
> >>>>>> + *      Number of pending bytes yet to be migrated from vendor driver
> >>>>>> + *
> >>>>>> + * data_offset: (read only)
> >>>>>> + *      User application should read data_offset in migration region from where
> >>>>>> + *      user application should read device data during _SAVING state or write
> >>>>>> + *      device data during _RESUMING state. See below for detail of sequence to
> >>>>>> + *      be followed.
> >>>>>> + *
> >>>>>> + * data_size: (read/write)
> >>>>>> + *      User application should read data_size to get size of data copied in
> >>>>>> + *      bytes in migration region during _SAVING state and write size of data
> >>>>>> + *      copied in bytes in migration region during _RESUMING state.
> >>>>>> + *
> >>>>>> + * Migration region looks like:
> >>>>>> + *  ------------------------------------------------------------------
> >>>>>> + * |vfio_device_migration_info|    data section                      |
> >>>>>> + * |                          |     ///////////////////////////////  |
> >>>>>> + * ------------------------------------------------------------------
> >>>>>> + *   ^                              ^
> >>>>>> + *  offset 0-trapped part        data_offset
> >>>>>> + *
> >>>>>> + * Structure vfio_device_migration_info is always followed by data section in
> >>>>>> + * the region, so data_offset will always be non-0. Offset from where data is
> >>>>>> + * copied is decided by kernel driver, data section can be trapped or mapped
> >>>>>> + * or partitioned, depending on how kernel driver defines data section.
> >>>>>> + * Data section partition can be defined as mapped by sparse mmap capability.
> >>>>>> + * If mmapped, then data_offset should be page aligned, where as initial section
> >>>>>> + * which contain vfio_device_migration_info structure might not end at offset
> >>>>>> + * which is page aligned. The user is not required to access via mmap regardless
> >>>>>> + * of the region mmap capabilities.
> >>>>>> + * Vendor driver should decide whether to partition data section and how to
> >>>>>> + * partition the data section. Vendor driver should return data_offset
> >>>>>> + * accordingly.
> >>>>>> + *
> >>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >>>>>> + * and for _SAVING device state or stop-and-copy phase:
> >>>>>> + * a. read pending_bytes, indicates start of new iteration to get device data.
> >>>>>> + *    If there was previous iteration, then this read operation indicates
> >>>>>> + *    previous iteration is done. If pending_bytes > 0, go through below steps.
> >>>>>> + * b. read data_offset, indicates kernel driver to make data available through
> >>>>>> + *    data section. Kernel driver should return this read operation only after
> >>>>>> + *    data is available from (region + data_offset) to (region + data_offset +
> >>>>>> + *    data_size).
> >>>>>> + * c. read data_size, amount of data in bytes available through migration
> >>>>>> + *    region.
> >>>>>> + * d. read data of data_size bytes from (region + data_offset) from migration
> >>>>>> + *    region.
> >>>>>> + * e. process data.
> >>>>>> + * f. Loop through a to e.  
> >>>>>
> >>>>> It seems we always need to end an iteration by reading pending_bytes to
> >>>>> signal to the vendor driver to release resources, so should the end of
> >>>>> the loop be:
> >>>>>
> >>>>> e. Read pending_bytes
> >>>>> f. Goto b. or optionally restart next iteration at a.
> >>>>>
> >>>>> I think this is defined such that reading data_offset commits resources
> >>>>> and reading pending_bytes frees them, allowing userspace to restart at
> >>>>> reading pending_bytes with no side-effects.  Therefore reading
> >>>>> pending_bytes repeatedly is supported.  Is the same true for
> >>>>> data_offset and data_size?  It seems reasonable that the vendor driver
> >>>>> can simply return offset and size for the current buffer if the user
> >>>>> reads these more than once.
> >>>>>        
> >>>>
> >>>> Right.  
> >>>
> >>> Can we add that to the spec?
> >>>      
> >>
> >> ok.
> >>  
> >>>>> How is a protocol or device error signaled?  For example, we can have a
> >>>>> user error where they read data_size before data_offset.  Should the
> >>>>> vendor driver generate a fault reading data_size in this case.  We can
> >>>>> also have internal errors in the vendor driver, should the vendor
> >>>>> driver use a special errno or update device_state autonomously to
> >>>>> indicate such an error?  
> >>>>
> >>>> If there is any error during the sequence, vendor driver can return
> >>>> error code for next read/write operation, that will terminate the loop
> >>>> and migration would fail.  
> >>>
> >>> Please add to spec.
> >>>      
> >>
> >> Ok
> >>  
> >>>>> I believe it's also part of the intended protocol that the user can
> >>>>> transition from _SAVING|_RUNNING to _SAVING at any point, regardless of
> >>>>> pending_bytes.  This should be noted.
> >>>>>         
> >>>>
> >>>> Ok. Updating comment.
> >>>>     
> >>>>>> + *
> >>>>>> + * Sequence to be followed while _RESUMING device state:
> >>>>>> + * While data for this device is available, repeat below steps:
> >>>>>> + * a. read data_offset from where user application should write data.
> >>>>>> + * b. write data of data_size to migration region from data_offset.  
> >>>>>
> >>>>> Whose's data_size, the _SAVING end or the _RESUMING end?  I think this
> >>>>> is intended to be the transaction size from the _SAVING source,  
> >>>>
> >>>> Not necessarily. data_size could be MIN(transaction size of source,
> >>>> migration data section). If migration data section is smaller than data
> >>>> packet size at source, then it has to be broken and iteratively sent.  
> >>>
> >>> So you're saying that a transaction from the source is divisible by the
> >>> user under certain conditions.  What other conditions exist?  
> >>
> >> I don't think there are any other conditions than above.
> >>  
> >>>   Can the
> >>> user decide arbitrary sizes less than the MIN() stated above?  This
> >>> needs to be specified.
> >>>     
> >>
> >> No, User can't decide arbitrary sizes.  
> > 
> > TBH, I'd expect a vendor driver that offers a different migration
> > region size, such that it becomes the user's responsibility to split
> > transactions should just claim it's not compatible with the source, as
> > determined by the previously defined compatibility protocol.  If we
> > really need this requirement, it needs to be justified and the exact
> > conditions under which the user performs this needs to be specified.
> >   
> 
> Let User decide whether it wants to support different migration region 
> sizes at source and destination or not instead of putting hard requirement.

The requirement is set forth by the vendor driver.  The user has no
choice how big a BAR region is for a vfio-pci device.  Perhaps if they
have internal knowledge of the device they might be able to only use
part of the BAR, but as decided previously the migration data is
intended to be opaque to the user.  If the user gets to decide the
migration region size then we're back to an interface where the vendor
driver is required to support arbitrary transaction sizes dictated by
the user splitting the data.

> >>>>> but it
> >>>>> could easily be misinterpreted as reading data_size on the _RESUMING
> >>>>> end.
> >>>>>         
> >>>>>> + * c. write data_size which indicates vendor driver that data is written in
> >>>>>> + *    staging buffer. Vendor driver should read this data from migration
> >>>>>> + *    region and resume device's state.  
> >>>>>
> >>>>> I think we also need to define the error protocol.  The user could
> >>>>> mis-order transactions or there could be an internal error in the
> >>>>> vendor driver or device.  Are all read(2)/write(2) operations
> >>>>> susceptible to defined errnos to signal this?  
> >>>>
> >>>> Yes.  
> >>>
> >>> And those defined errnos are specified...
> >>>      
> >>
> >> Those could be standard errors like -EINVAL, ENOMEM....  
> > 
> > I thought we might specify specific errors to consistently indicate
> > non-continuable faults among vendor drivers.  Is anything other than
> > -EAGAIN considered non-fatal to the operation?  For example, could
> > EEXIST indicate duplicate data that the user has already written but
> > not be considered a fatal error?  Would EFAULT perhaps indicate a
> > continuable ordering error?  If any fault indicates the save/resume has
> > failed, shouldn't the user be able to see the device is in such a state
> > by reading device_state (except we have no state defined to indicate
> > that)?
> >     
> 
> Do we have to define all standard errors returned here would mean meant 
> what?
> 
> Right from initial versions of migration reviews we always thought that 
> device_state should be only set by user, vendor driver could return 
> error state was never thought of. Returning error to read()/write() 
> operation indicate that device is not able to handle that operation so 
> user will decide what action to be taken next.
> Now you are proposing to add a state that vendor driver can set, as 
> defined in my above comment?

The question is how does the user decide what action to be taken next.
Defining specific errnos (not all of them) might aid in that decision.
Maybe the simplest course of action is to require that the user is
perfect, if they duplicate a transaction or misorder anything on the
resuming end, we generate an errno on write(2) to the data area and the
device goes to device_state 110b.  This makes it clear that we are no
longer in a resuming state and the user needs to start over by
resetting the device.  An internal error on resume would be handled the
same.

On saving, the responsibility is primarily with the vendor driver, but
transitioning the device_state to invalid and requiring a reset should
not be taken as lightly as the resume path.  So does any errno on
read(2) during the saving protocol indicate failure?  Would the vendor
driver self-transition to !_SAVING in device_state?

For both cases, do we want to make an exception for -EAGAIN as
non-fatal?

> >>>>>    Is it reflected in
> >>>>> device_state?  
> >>>>
> >>>> No.  
> >>>
> >>> So a user should do what, just keep trying?
> >>>     
> >>
> >> No, fail migration process. If error is at source or destination then
> >> user can decide either resume at source or terminate application.  
> > 
> > This is describing the expected QEMU protocol resolution, the question
> > is relative to the vfio API we're defining here.  If any fault in the
> > save/resume protocol results in the device being unusable, there should
> > be an indication (perhaps through device_state) that the device is in a
> > broken state, and the mechanism to put it into a new state should be
> > defined.  For instance, if the device is resuming, a fault occurs
> > writing state data to the device, and the user transitions to running.
> > Is the device incorporating the partial state data into its run state?
> > I suspect not, and wouldn't that be more obvious if we defined a
> > protocol where the device can be inspected to be in a bogus state via
> > reading device_state, at which point we might define performing a
> > device reset as the only mechanism to change the device_state after
> > that point.
> >  
> 
> Same as above my comment.
> 
> >>>>> What's the recovery protocol?
> >>>>>         
> >>>>
> >>>> On read()/write() failure user should take necessary action.  
> >>>
> >>> Where is that necessary action defined?  Can they just try again?  Do
> >>> they transition in and out of _RESUMING to try again?  Do they need to
> >>> reset the device?
> >>>      
> >>
> >> User application should decide what action to take on failure, right?
> >>    "vfio is not prescribing the migration semantics to userspace, it's
> >> presenting an interface that support the user semantics."  
> > 
> > Exactly, we're not defining how QEMU handles a fault in this spec,
> > we're defining how a user interacting with the device knows a fault has
> > occurred, can inspect the device to determine that the device is in a
> > broken state, and the "necessary action" to advance the device forward
> > to a new state.
> >   
> >>>>>> + *
> >>>>>> + * For user application, data is opaque. User should write data in the same
> >>>>>> + * order as received.  
> >>>>>
> >>>>> Order and transaction size, ie. each data_size chunk is indivisible by
> >>>>> the user.  
> >>>>
> >>>> Transaction size can differ, but order should remain same.  
> >>>
> >>> Under what circumstances and to what extent can transaction size
> >>> differ?  
> >>
> >> It depends in migration region size.
> >>  
> >>>   Is the MIN() algorithm above the absolute lower bound or just
> >>> a suggestion?  
> >>
> >>
> >>  
> >>>   Is the user allowed to concatenate transactions from the
> >>> source together on the target if the region is sufficiently large?  
> >>
> >> Yes that can be done, because data is just byte stream for user. Vendor
> >> driver receives the byte stream and knows how to decode it.  
> > 
> > But that byte stream is opaque to the user, the vendor driver might
> > implement it such that every transaction has a header and splitting the
> > transaction might mean that the truncated transaction no longer fits
> > the expected size.  If we're lucky, the vendor driver's implementation
> > might detect that.  If we're not, the vendor driver might misinterpret
> > the next packet.  I think if the user is to consider all data as
> > opaque, they must also consider every transaction as indivisible or
> > else we're assuming something about the contents of that transaction.
> >   
> 
> User shouldn't assume about contents of transactions.
> I think vendor driver should consider incomming data as data byte stream 
> and decoding packets should not be based on migration region size.

I think this likely matches your implementation, but even defining the
migration stream as a byte stream imposes implementation constraints on
other vendor drivers.  They may intend to implement it as discrete
packets with headers known only to them.  We can't both say it's "opaque
and left to the vendor driver" and also define it as a position-less
byte stream.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-19 16:21                 ` Kirti Wankhede
@ 2019-12-20  0:58                   ` Yan Zhao
  2020-01-03 19:44                     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 44+ messages in thread
From: Yan Zhao @ 2019-12-20  0:58 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Dr. David Alan Gilbert, alex.williamson, cjia, Tian, Kevin, Yang,
	Ziye, Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, Ken.Xue, Wang,
	Zhi A, qemu-devel, kvm

On Fri, Dec 20, 2019 at 12:21:45AM +0800, Kirti Wankhede wrote:
> 
> 
> On 12/19/2019 6:27 AM, Yan Zhao wrote:
> > On Thu, Dec 19, 2019 at 04:05:52AM +0800, Dr. David Alan Gilbert wrote:
> >> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> >>> On Tue, Dec 17, 2019 at 07:47:05PM +0800, Kirti Wankhede wrote:
> >>>>
> >>>>
> >>>> On 12/17/2019 3:21 PM, Yan Zhao wrote:
> >>>>> On Tue, Dec 17, 2019 at 05:24:14PM +0800, Kirti Wankhede wrote:
> >>>>>>>>     
> >>>>>>>>     		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>>>>>>>     			-EFAULT : 0;
> >>>>>>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> >>>>>>>> +		struct vfio_iommu_type1_dirty_bitmap range;
> >>>>>>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> >>>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> >>>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> >>>>>>>> +		int ret;
> >>>>>>>> +
> >>>>>>>> +		if (!iommu->v2)
> >>>>>>>> +			return -EACCES;
> >>>>>>>> +
> >>>>>>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> >>>>>>>> +				    bitmap);
> >>>>>>>> +
> >>>>>>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> >>>>>>>> +			return -EFAULT;
> >>>>>>>> +
> >>>>>>>> +		if (range.argsz < minsz || range.flags & ~mask)
> >>>>>>>> +			return -EINVAL;
> >>>>>>>> +
> >>>>>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >>>>>>>> +			iommu->dirty_page_tracking = true;
> >>>>>>>> +			return 0;
> >>>>>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >>>>>>>> +			iommu->dirty_page_tracking = false;
> >>>>>>>> +
> >>>>>>>> +			mutex_lock(&iommu->lock);
> >>>>>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
> >>>>>>>> +			mutex_unlock(&iommu->lock);
> >>>>>>>> +			return 0;
> >>>>>>>> +
> >>>>>>>> +		} else if (range.flags &
> >>>>>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >>>>>>>> +			uint64_t iommu_pgmask;
> >>>>>>>> +			unsigned long pgshift = __ffs(range.pgsize);
> >>>>>>>> +			unsigned long *bitmap;
> >>>>>>>> +			long bsize;
> >>>>>>>> +
> >>>>>>>> +			iommu_pgmask =
> >>>>>>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >>>>>>>> +
> >>>>>>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> >>>>>>>> +			    (range.pgsize - 1))
> >>>>>>>> +				return -EINVAL;
> >>>>>>>> +
> >>>>>>>> +			if (range.iova & iommu_pgmask)
> >>>>>>>> +				return -EINVAL;
> >>>>>>>> +			if (!range.size || range.size > SIZE_MAX)
> >>>>>>>> +				return -EINVAL;
> >>>>>>>> +			if (range.iova + range.size < range.iova)
> >>>>>>>> +				return -EINVAL;
> >>>>>>>> +
> >>>>>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >>>>>>>> +						   range.bitmap_size);
> >>>>>>>> +			if (bsize)
> >>>>>>>> +				return ret;
> >>>>>>>> +
> >>>>>>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> >>>>>>>> +			if (!bitmap)
> >>>>>>>> +				return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> +			ret = copy_from_user(bitmap,
> >>>>>>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> >>>>>>>> +			if (ret)
> >>>>>>>> +				goto bitmap_exit;
> >>>>>>>> +
> >>>>>>>> +			iommu->dirty_page_tracking = false;
> >>>>>>> why iommu->dirty_page_tracking is false here?
> >>>>>>> suppose this ioctl can be called several times.
> >>>>>>>
> >>>>>>
> >>>>>> This ioctl can be called several times, but once this ioctl is called
> >>>>>> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
> >>>>>> stop-and-copy phase) and dirty pages bitmap are being queried by user.
> >>>>>>
> >>>>> can't agree that VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP can only be
> >>>>> called in stop-and-copy phase.
> >>>>> As stated in last version, this will cause QEMU to get a wrong expectation
> >>>>> of VM downtime and this is also the reason for previously pinned pages
> >>>>> before log_sync cannot be treated as dirty. If this get bitmap ioctl can
> >>>>> be called early in save_setup phase, then it's no problem even all ram
> >>>>> is dirty.
> >>>>>
> >>>>
> >>>> Device can also write to pages which are pinned, and then there is no
> >>>> way to know pages dirtied by device during pre-copy phase.
> >>>> If user ask dirty bitmap in per-copy phase, even then user will have to
> >>>> query dirty bitmap in stop-and-copy phase where this will be superset
> >>>> including all pages reported during pre-copy. Then instead of copying
> >>>> all pages twice, its better to do it once during stop-and-copy phase.
> >>>>
> >>> I think the flow should be like this:
> >>> 1. save_setup --> GET_BITMAP ioctl --> return bitmap for currently + previously
> >>> pinned pages and clean all previously pinned pages
> >>>
> >>> 2. save_pending --> GET_BITMAP ioctl  --> return bitmap of (currently
> >>> pinned pages + previously pinned pages since last clean) and clean all
> >>> previously pinned pages
> >>>
> >>> 3. save_complete_precopy --> GET_BITMAP ioctl --> return bitmap of (currently
> >>> pinned pages + previously pinned pages since last clean) and clean all
> >>> previously pinned pages
> >>>
> >>>
> >>> Copying pinned pages multiple times is unavoidable because those pinned pages
> >>> are always treated as dirty. That's per vendor's implementation.
> >>> But if the pinned pages are not reported as dirty before stop-and-copy phase,
> >>> QEMU would think dirty pages has converged
> >>> and enter blackout phase, making downtime_limit severely incorrect.
> >>
> >> I'm not sure it's any worse.
> >> I *think* we do a last sync after we've decided to go to stop-and-copy;
> >> wont that then mark all those pages as dirty again, so it'll have the
> >> same behaviour?
> > No. something will be different.
> > currently, in kirti's implementation, if GET_BITMAP ioctl is called only
> > once in stop-and-copy phase, then before that phase, QEMU does not know those
> > pages are dirty.
> > If we can report those dirty pages earlier before stop-and-copy phase,
> > QEMU can at least copy other pages to reduce dirty pages to below threshold.
> > 
> > Take a example, let's assume those vfio dirty pages is 1Gb, and network speed is
> > also 1Gb. Expected vm downtime is 1s.
> > If before stop-and-copy phase, dirty pages produced by other pages is
> > also 1Gb. To meet the expected vm downtime, QEMU should copy pages to
> > let dirty pages be less than 1Gb, otherwise, it should not complete live
> > migration.
> > If vfio does not report this 1Gb dirty pages, QEMU would think there's
> > only 1Gb and stop the vm. It would then find out there's actually 2Gb and vm
> > downtime is 2s.
> > Though the expected vm downtime is always not exactly the same as the
> > true vm downtime, it should be caused by rapid dirty page rate, which is
> > not predictable.
> > Right?
> > 
> 
> If you report vfio dirty pages 1Gb before stop-and-copy phase (i.e. in 
> pre-copy phase), enter into stop-and-copy phase, how will you know which 
> and how many pages are dirtied by device from the time when pages copied 
> in pre-copy phase to that time where device is stopped? You don't have a 
> way to know which pages are dirtied by device. So ideally device can 
> write to all pages which are pinned. Then we have to mark all those 
> pinned pages dirty in stop-and-copy phase, 1Gb, and copy to destination. 
> Now you had copied same pages twice. Shouldn't we try not to copy pages 
> twice?
>
For mdevs who reply on treating all pinned pages as dirty pages, as above
mentioned condition, repeated page copying can be avoided by
(1) adding an ioctl to get size of dirty pages and report it to QEMU through
vfio_save_pending() interface
(2) in this GET_BITMAP ioctl, empty bitmap is returned until stop-and-copy phase.

But for devices who know fine-grained dirty pages, (e.g. devices have ditry page
tracking in hardware), GET_BITMAP ioctl has to return incremental dirty
bitmaps in each iteration and step (1) should return 0 to avoid 2*size
of dirty page reported.

Thanks
Yan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-12-19 21:09               ` Alex Williamson
@ 2020-01-02 18:25                 ` Dr. David Alan Gilbert
  2020-01-06 23:18                   ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-01-02 18:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Fri, 20 Dec 2019 01:40:35 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 12/19/2019 10:57 PM, Alex Williamson wrote:
> > 
> > <Snip>
> > 

<snip>

> > 
> > If device state it at pre-copy state (011b).
> > Transition, i.e., write to device state as stop-and-copy state (010b) 
> > failed, then by previous state I meant device should return pre-copy 
> > state(011b), i.e. previous state which was successfully set, or as you 
> > said current state which was successfully set.
> 
> Yes, the point I'm trying to make is that this version of the spec
> tries to tell the user what they should do upon error according to our
> current interpretation of the QEMU migration protocol.  We're not
> defining the QEMU migration protocol, we're defining something that can
> be used in a way to support that protocol.  So I think we should be
> concerned with defining our spec, for example my proposal would be: "If
> a state transition fails the user can read device_state to determine the
> current state of the device.  This should be the previous state of the
> device unless the vendor driver has encountered an internal error, in
> which case the device may report the invalid device_state 110b.  The
> user must use the device reset ioctl in order to recover the device
> from this state.  If the device is indicated in a valid device state
> via reading device_state, the user may attempt to transition the device
> to any valid state reachable from the current state."

We might want to be able to distinguish between:
  a) The device has failed and needs a reset
  b) The migration has failed

If some part of the devices mechanics for migration fail, but the device
is otherwise operational then we should be able to decide to fail the
migration without taking the device down, which might be very bad for
the VM.
Losing a VM during migration due to a problem with migration really
annoys users; it's one thing the migration failing, but taking the VM
out as well really gets to them.

Having the device automatically transition back to the 'running' state
seems a bad idea to me; much better to tell the hypervisor and provide
it with a way to clean up; for example, imagine a system with multiple
devices that are being migrated, most of them have happily transitioned
to stop-and-copy, but then the last device decides to fail - so now
someone is going to have to take all of them back to running.

Dave

> > >>> and allowable state transitions independent
> > >>> of the expected usage model.  
> > >>
> > >> Do you mean to define array of ['from','to'], same as runstate
> > >> transition array in QEMU?
> > >>    static const RunStateTransition runstate_transitions_def[]  
> > > 
> > > I'm thinking that independent of expected QEMU usage models, are there
> > > any invalid transitions or is every state reachable from every other
> > > state.  I'm afraid this design is so focused on a specific usage model
> > > that vendor drivers are going to fall over if the user invokes a
> > > transition outside of those listed above.  If there are invalid
> > > transitions, those should be listed so they can be handled
> > > consistently.  If there are no invalid transitions, it should be noted
> > > in the spec to encourage vendor drivers to expect this.
> > >   
> > 
> > I think vendor driver can decide which state transitions it can support, 
> > rather than defining/prescribing that all.
> > Suppose, if vendor driver doesn't want to support save-restore 
> > functionality, then vendor driver can return error -EINVAL for write() 
> > operation on device_state for transition from _RUNNING to 
> > stop-and-copy(010b) state.
> 
> This is unsupportable.  If the vendor driver doesn't want to support
> save-restore then they simply do not implement the migration
> extensions.  If they expose this interface then the user (QEMU) will
> rightfully assume that the device supports migration, only to find out
> upon trying to use it that it's unsupported, or maybe broken.
> 
> > >>> For example, I think a user should always
> > >>> be allowed to transition a device to stopped regardless of the expected
> > >>> migration flow.  An error might have occurred elsewhere and we want to
> > >>> stop everything for debugging.  I think it's also allowable to switch
> > >>> directly from running to stop-and-copy, for example to save and resume
> > >>> a VM offline.
> > >>>        
> > >>>>> Also, it seems like it's the vendor driver's discretion to actually
> > >>>>> provide data during the pre-copy phase.  As we've defined it, the
> > >>>>> vendor driver needs to participate in the migration region regardless,
> > >>>>> they might just always report no pending_bytes until we enter
> > >>>>> stop-and-copy.
> > >>>>>         
> > >>>>
> > >>>> Yes. And if pending_bytes are reported as 0 in pre-copy by vendor driver
> > >>>> then QEMU doesn't reiterate for that device.  
> > >>>
> > >>> Maybe we can state that as the expected mechanism to avoid a vendor
> > >>> driver trying to invent alternative means, ex. failing transition to
> > >>> pre-copy, requesting new flags, etc.
> > >>>      
> > >>
> > >> Isn't Sequence to be followed below sufficient to state that?  
> > > 
> > > I think we understand it because we've been discussing it so long, but
> > > without that background it could be subtle.
> > >     
> > >>>>>> + * 4. To start resuming phase, VFIO device state should be transitioned from
> > >>>>>> + *    _RUNNING to _RESUMING state.
> > >>>>>> + *    In _RESUMING state, driver should use received device state data through
> > >>>>>> + *    migration region to resume device.
> > >>>>>> + *    On failure during this state transition, application should set _RUNNING
> > >>>>>> + *    state.  
> > >>>>>
> > >>>>> Same comment regarding setting next state after failure.  
> > >>>>
> > >>>> If device couldn't be transitioned to _RESUMING, then it should be set
> > >>>> to default state, that is _RUNNING.
> > >>>>     
> > >>>>>         
> > >>>>>> + * 5. On providing saved device data to driver, appliation should change state
> > >>>>>> + *    from _RESUMING to _RUNNING.
> > >>>>>> + *    On failure to transition to _RUNNING state, VFIO application should reset
> > >>>>>> + *    the device and set _RUNNING state so that device doesn't remain in unknown
> > >>>>>> + *    or bad state. On reset, driver must reset device and device should be
> > >>>>>> + *    available in default usable state.  
> > >>>>>
> > >>>>> Didn't we discuss that the reset ioctl should return the device to the
> > >>>>> initial state, including the transition to _RUNNING?  
> > >>>>
> > >>>> Yes, that's default usable state, rewording it to initial state.
> > >>>>     
> > >>>>>    Also, as above,
> > >>>>> it's the user write that triggers the failure, this register is listed
> > >>>>> as read-write, so what value does the vendor driver report for the
> > >>>>> state when read after a transition failure?  Is it reported as _RESUMING
> > >>>>> as it was prior to the attempted transition, or may the invalid states
> > >>>>> be used by the vendor driver to indicate the device is broken?
> > >>>>>         
> > >>>>
> > >>>> If transition as failed, device should report its previous state and
> > >>>> reset device should bring back to usable _RUNNING state.  
> > >>>
> > >>> If device_state reports previous state then user should reasonably
> > >>> infer that the device is already in that sate without a need for them
> > >>> to set it, IMO.  
> > >>
> > >> But if there is any error in read()/write() then user should device
> > >> which next state device should be put in, which would be different that
> > >> previous state.  
> > > 
> > > That's a different answer than telling the user the next state should
> > > be _RUNNING.
> > >   
> > >>>>>> + *
> > >>>>>> + * pending bytes: (read only)
> > >>>>>> + *      Number of pending bytes yet to be migrated from vendor driver
> > >>>>>> + *
> > >>>>>> + * data_offset: (read only)
> > >>>>>> + *      User application should read data_offset in migration region from where
> > >>>>>> + *      user application should read device data during _SAVING state or write
> > >>>>>> + *      device data during _RESUMING state. See below for detail of sequence to
> > >>>>>> + *      be followed.
> > >>>>>> + *
> > >>>>>> + * data_size: (read/write)
> > >>>>>> + *      User application should read data_size to get size of data copied in
> > >>>>>> + *      bytes in migration region during _SAVING state and write size of data
> > >>>>>> + *      copied in bytes in migration region during _RESUMING state.
> > >>>>>> + *
> > >>>>>> + * Migration region looks like:
> > >>>>>> + *  ------------------------------------------------------------------
> > >>>>>> + * |vfio_device_migration_info|    data section                      |
> > >>>>>> + * |                          |     ///////////////////////////////  |
> > >>>>>> + * ------------------------------------------------------------------
> > >>>>>> + *   ^                              ^
> > >>>>>> + *  offset 0-trapped part        data_offset
> > >>>>>> + *
> > >>>>>> + * Structure vfio_device_migration_info is always followed by data section in
> > >>>>>> + * the region, so data_offset will always be non-0. Offset from where data is
> > >>>>>> + * copied is decided by kernel driver, data section can be trapped or mapped
> > >>>>>> + * or partitioned, depending on how kernel driver defines data section.
> > >>>>>> + * Data section partition can be defined as mapped by sparse mmap capability.
> > >>>>>> + * If mmapped, then data_offset should be page aligned, where as initial section
> > >>>>>> + * which contain vfio_device_migration_info structure might not end at offset
> > >>>>>> + * which is page aligned. The user is not required to access via mmap regardless
> > >>>>>> + * of the region mmap capabilities.
> > >>>>>> + * Vendor driver should decide whether to partition data section and how to
> > >>>>>> + * partition the data section. Vendor driver should return data_offset
> > >>>>>> + * accordingly.
> > >>>>>> + *
> > >>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> > >>>>>> + * and for _SAVING device state or stop-and-copy phase:
> > >>>>>> + * a. read pending_bytes, indicates start of new iteration to get device data.
> > >>>>>> + *    If there was previous iteration, then this read operation indicates
> > >>>>>> + *    previous iteration is done. If pending_bytes > 0, go through below steps.
> > >>>>>> + * b. read data_offset, indicates kernel driver to make data available through
> > >>>>>> + *    data section. Kernel driver should return this read operation only after
> > >>>>>> + *    data is available from (region + data_offset) to (region + data_offset +
> > >>>>>> + *    data_size).
> > >>>>>> + * c. read data_size, amount of data in bytes available through migration
> > >>>>>> + *    region.
> > >>>>>> + * d. read data of data_size bytes from (region + data_offset) from migration
> > >>>>>> + *    region.
> > >>>>>> + * e. process data.
> > >>>>>> + * f. Loop through a to e.  
> > >>>>>
> > >>>>> It seems we always need to end an iteration by reading pending_bytes to
> > >>>>> signal to the vendor driver to release resources, so should the end of
> > >>>>> the loop be:
> > >>>>>
> > >>>>> e. Read pending_bytes
> > >>>>> f. Goto b. or optionally restart next iteration at a.
> > >>>>>
> > >>>>> I think this is defined such that reading data_offset commits resources
> > >>>>> and reading pending_bytes frees them, allowing userspace to restart at
> > >>>>> reading pending_bytes with no side-effects.  Therefore reading
> > >>>>> pending_bytes repeatedly is supported.  Is the same true for
> > >>>>> data_offset and data_size?  It seems reasonable that the vendor driver
> > >>>>> can simply return offset and size for the current buffer if the user
> > >>>>> reads these more than once.
> > >>>>>        
> > >>>>
> > >>>> Right.  
> > >>>
> > >>> Can we add that to the spec?
> > >>>      
> > >>
> > >> ok.
> > >>  
> > >>>>> How is a protocol or device error signaled?  For example, we can have a
> > >>>>> user error where they read data_size before data_offset.  Should the
> > >>>>> vendor driver generate a fault reading data_size in this case.  We can
> > >>>>> also have internal errors in the vendor driver, should the vendor
> > >>>>> driver use a special errno or update device_state autonomously to
> > >>>>> indicate such an error?  
> > >>>>
> > >>>> If there is any error during the sequence, vendor driver can return
> > >>>> error code for next read/write operation, that will terminate the loop
> > >>>> and migration would fail.  
> > >>>
> > >>> Please add to spec.
> > >>>      
> > >>
> > >> Ok
> > >>  
> > >>>>> I believe it's also part of the intended protocol that the user can
> > >>>>> transition from _SAVING|_RUNNING to _SAVING at any point, regardless of
> > >>>>> pending_bytes.  This should be noted.
> > >>>>>         
> > >>>>
> > >>>> Ok. Updating comment.
> > >>>>     
> > >>>>>> + *
> > >>>>>> + * Sequence to be followed while _RESUMING device state:
> > >>>>>> + * While data for this device is available, repeat below steps:
> > >>>>>> + * a. read data_offset from where user application should write data.
> > >>>>>> + * b. write data of data_size to migration region from data_offset.  
> > >>>>>
> > >>>>> Whose's data_size, the _SAVING end or the _RESUMING end?  I think this
> > >>>>> is intended to be the transaction size from the _SAVING source,  
> > >>>>
> > >>>> Not necessarily. data_size could be MIN(transaction size of source,
> > >>>> migration data section). If migration data section is smaller than data
> > >>>> packet size at source, then it has to be broken and iteratively sent.  
> > >>>
> > >>> So you're saying that a transaction from the source is divisible by the
> > >>> user under certain conditions.  What other conditions exist?  
> > >>
> > >> I don't think there are any other conditions than above.
> > >>  
> > >>>   Can the
> > >>> user decide arbitrary sizes less than the MIN() stated above?  This
> > >>> needs to be specified.
> > >>>     
> > >>
> > >> No, User can't decide arbitrary sizes.  
> > > 
> > > TBH, I'd expect a vendor driver that offers a different migration
> > > region size, such that it becomes the user's responsibility to split
> > > transactions should just claim it's not compatible with the source, as
> > > determined by the previously defined compatibility protocol.  If we
> > > really need this requirement, it needs to be justified and the exact
> > > conditions under which the user performs this needs to be specified.
> > >   
> > 
> > Let User decide whether it wants to support different migration region 
> > sizes at source and destination or not instead of putting hard requirement.
> 
> The requirement is set forth by the vendor driver.  The user has no
> choice how big a BAR region is for a vfio-pci device.  Perhaps if they
> have internal knowledge of the device they might be able to only use
> part of the BAR, but as decided previously the migration data is
> intended to be opaque to the user.  If the user gets to decide the
> migration region size then we're back to an interface where the vendor
> driver is required to support arbitrary transaction sizes dictated by
> the user splitting the data.
> 
> > >>>>> but it
> > >>>>> could easily be misinterpreted as reading data_size on the _RESUMING
> > >>>>> end.
> > >>>>>         
> > >>>>>> + * c. write data_size which indicates vendor driver that data is written in
> > >>>>>> + *    staging buffer. Vendor driver should read this data from migration
> > >>>>>> + *    region and resume device's state.  
> > >>>>>
> > >>>>> I think we also need to define the error protocol.  The user could
> > >>>>> mis-order transactions or there could be an internal error in the
> > >>>>> vendor driver or device.  Are all read(2)/write(2) operations
> > >>>>> susceptible to defined errnos to signal this?  
> > >>>>
> > >>>> Yes.  
> > >>>
> > >>> And those defined errnos are specified...
> > >>>      
> > >>
> > >> Those could be standard errors like -EINVAL, ENOMEM....  
> > > 
> > > I thought we might specify specific errors to consistently indicate
> > > non-continuable faults among vendor drivers.  Is anything other than
> > > -EAGAIN considered non-fatal to the operation?  For example, could
> > > EEXIST indicate duplicate data that the user has already written but
> > > not be considered a fatal error?  Would EFAULT perhaps indicate a
> > > continuable ordering error?  If any fault indicates the save/resume has
> > > failed, shouldn't the user be able to see the device is in such a state
> > > by reading device_state (except we have no state defined to indicate
> > > that)?
> > >     
> > 
> > Do we have to define all standard errors returned here would mean meant 
> > what?
> > 
> > Right from initial versions of migration reviews we always thought that 
> > device_state should be only set by user, vendor driver could return 
> > error state was never thought of. Returning error to read()/write() 
> > operation indicate that device is not able to handle that operation so 
> > user will decide what action to be taken next.
> > Now you are proposing to add a state that vendor driver can set, as 
> > defined in my above comment?
> 
> The question is how does the user decide what action to be taken next.
> Defining specific errnos (not all of them) might aid in that decision.
> Maybe the simplest course of action is to require that the user is
> perfect, if they duplicate a transaction or misorder anything on the
> resuming end, we generate an errno on write(2) to the data area and the
> device goes to device_state 110b.  This makes it clear that we are no
> longer in a resuming state and the user needs to start over by
> resetting the device.  An internal error on resume would be handled the
> same.
> 
> On saving, the responsibility is primarily with the vendor driver, but
> transitioning the device_state to invalid and requiring a reset should
> not be taken as lightly as the resume path.  So does any errno on
> read(2) during the saving protocol indicate failure?  Would the vendor
> driver self-transition to !_SAVING in device_state?
> 
> For both cases, do we want to make an exception for -EAGAIN as
> non-fatal?
> 
> > >>>>>    Is it reflected in
> > >>>>> device_state?  
> > >>>>
> > >>>> No.  
> > >>>
> > >>> So a user should do what, just keep trying?
> > >>>     
> > >>
> > >> No, fail migration process. If error is at source or destination then
> > >> user can decide either resume at source or terminate application.  
> > > 
> > > This is describing the expected QEMU protocol resolution, the question
> > > is relative to the vfio API we're defining here.  If any fault in the
> > > save/resume protocol results in the device being unusable, there should
> > > be an indication (perhaps through device_state) that the device is in a
> > > broken state, and the mechanism to put it into a new state should be
> > > defined.  For instance, if the device is resuming, a fault occurs
> > > writing state data to the device, and the user transitions to running.
> > > Is the device incorporating the partial state data into its run state?
> > > I suspect not, and wouldn't that be more obvious if we defined a
> > > protocol where the device can be inspected to be in a bogus state via
> > > reading device_state, at which point we might define performing a
> > > device reset as the only mechanism to change the device_state after
> > > that point.
> > >  
> > 
> > Same as above my comment.
> > 
> > >>>>> What's the recovery protocol?
> > >>>>>         
> > >>>>
> > >>>> On read()/write() failure user should take necessary action.  
> > >>>
> > >>> Where is that necessary action defined?  Can they just try again?  Do
> > >>> they transition in and out of _RESUMING to try again?  Do they need to
> > >>> reset the device?
> > >>>      
> > >>
> > >> User application should decide what action to take on failure, right?
> > >>    "vfio is not prescribing the migration semantics to userspace, it's
> > >> presenting an interface that support the user semantics."  
> > > 
> > > Exactly, we're not defining how QEMU handles a fault in this spec,
> > > we're defining how a user interacting with the device knows a fault has
> > > occurred, can inspect the device to determine that the device is in a
> > > broken state, and the "necessary action" to advance the device forward
> > > to a new state.
> > >   
> > >>>>>> + *
> > >>>>>> + * For user application, data is opaque. User should write data in the same
> > >>>>>> + * order as received.  
> > >>>>>
> > >>>>> Order and transaction size, ie. each data_size chunk is indivisible by
> > >>>>> the user.  
> > >>>>
> > >>>> Transaction size can differ, but order should remain same.  
> > >>>
> > >>> Under what circumstances and to what extent can transaction size
> > >>> differ?  
> > >>
> > >> It depends in migration region size.
> > >>  
> > >>>   Is the MIN() algorithm above the absolute lower bound or just
> > >>> a suggestion?  
> > >>
> > >>
> > >>  
> > >>>   Is the user allowed to concatenate transactions from the
> > >>> source together on the target if the region is sufficiently large?  
> > >>
> > >> Yes that can be done, because data is just byte stream for user. Vendor
> > >> driver receives the byte stream and knows how to decode it.  
> > > 
> > > But that byte stream is opaque to the user, the vendor driver might
> > > implement it such that every transaction has a header and splitting the
> > > transaction might mean that the truncated transaction no longer fits
> > > the expected size.  If we're lucky, the vendor driver's implementation
> > > might detect that.  If we're not, the vendor driver might misinterpret
> > > the next packet.  I think if the user is to consider all data as
> > > opaque, they must also consider every transaction as indivisible or
> > > else we're assuming something about the contents of that transaction.
> > >   
> > 
> > User shouldn't assume about contents of transactions.
> > I think vendor driver should consider incomming data as data byte stream 
> > and decoding packets should not be based on migration region size.
> 
> I think this likely matches your implementation, but even defining the
> migration stream as a byte stream imposes implementation constraints on
> other vendor drivers.  They may intend to implement it as discrete
> packets with headers known only to them.  We can't both say it's "opaque
> and left to the vendor driver" and also define it as a position-less
> byte stream.  Thanks,
> 
> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-20  0:58                   ` Yan Zhao
@ 2020-01-03 19:44                     ` Dr. David Alan Gilbert
  2020-01-04  3:53                       ` Yan Zhao
  0 siblings, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-01-03 19:44 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kirti Wankhede, alex.williamson, cjia, Tian, Kevin, Yang, Ziye,
	Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, Ken.Xue, Wang,
	Zhi A, qemu-devel, kvm

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> On Fri, Dec 20, 2019 at 12:21:45AM +0800, Kirti Wankhede wrote:
> > 
> > 
> > On 12/19/2019 6:27 AM, Yan Zhao wrote:
> > > On Thu, Dec 19, 2019 at 04:05:52AM +0800, Dr. David Alan Gilbert wrote:
> > >> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > >>> On Tue, Dec 17, 2019 at 07:47:05PM +0800, Kirti Wankhede wrote:
> > >>>>
> > >>>>
> > >>>> On 12/17/2019 3:21 PM, Yan Zhao wrote:
> > >>>>> On Tue, Dec 17, 2019 at 05:24:14PM +0800, Kirti Wankhede wrote:
> > >>>>>>>>     
> > >>>>>>>>     		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > >>>>>>>>     			-EFAULT : 0;
> > >>>>>>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> > >>>>>>>> +		struct vfio_iommu_type1_dirty_bitmap range;
> > >>>>>>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> > >>>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> > >>>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> > >>>>>>>> +		int ret;
> > >>>>>>>> +
> > >>>>>>>> +		if (!iommu->v2)
> > >>>>>>>> +			return -EACCES;
> > >>>>>>>> +
> > >>>>>>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> > >>>>>>>> +				    bitmap);
> > >>>>>>>> +
> > >>>>>>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> > >>>>>>>> +			return -EFAULT;
> > >>>>>>>> +
> > >>>>>>>> +		if (range.argsz < minsz || range.flags & ~mask)
> > >>>>>>>> +			return -EINVAL;
> > >>>>>>>> +
> > >>>>>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> > >>>>>>>> +			iommu->dirty_page_tracking = true;
> > >>>>>>>> +			return 0;
> > >>>>>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> > >>>>>>>> +			iommu->dirty_page_tracking = false;
> > >>>>>>>> +
> > >>>>>>>> +			mutex_lock(&iommu->lock);
> > >>>>>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
> > >>>>>>>> +			mutex_unlock(&iommu->lock);
> > >>>>>>>> +			return 0;
> > >>>>>>>> +
> > >>>>>>>> +		} else if (range.flags &
> > >>>>>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> > >>>>>>>> +			uint64_t iommu_pgmask;
> > >>>>>>>> +			unsigned long pgshift = __ffs(range.pgsize);
> > >>>>>>>> +			unsigned long *bitmap;
> > >>>>>>>> +			long bsize;
> > >>>>>>>> +
> > >>>>>>>> +			iommu_pgmask =
> > >>>>>>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > >>>>>>>> +
> > >>>>>>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> > >>>>>>>> +			    (range.pgsize - 1))
> > >>>>>>>> +				return -EINVAL;
> > >>>>>>>> +
> > >>>>>>>> +			if (range.iova & iommu_pgmask)
> > >>>>>>>> +				return -EINVAL;
> > >>>>>>>> +			if (!range.size || range.size > SIZE_MAX)
> > >>>>>>>> +				return -EINVAL;
> > >>>>>>>> +			if (range.iova + range.size < range.iova)
> > >>>>>>>> +				return -EINVAL;
> > >>>>>>>> +
> > >>>>>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> > >>>>>>>> +						   range.bitmap_size);
> > >>>>>>>> +			if (bsize)
> > >>>>>>>> +				return ret;
> > >>>>>>>> +
> > >>>>>>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> > >>>>>>>> +			if (!bitmap)
> > >>>>>>>> +				return -ENOMEM;
> > >>>>>>>> +
> > >>>>>>>> +			ret = copy_from_user(bitmap,
> > >>>>>>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> > >>>>>>>> +			if (ret)
> > >>>>>>>> +				goto bitmap_exit;
> > >>>>>>>> +
> > >>>>>>>> +			iommu->dirty_page_tracking = false;
> > >>>>>>> why iommu->dirty_page_tracking is false here?
> > >>>>>>> suppose this ioctl can be called several times.
> > >>>>>>>
> > >>>>>>
> > >>>>>> This ioctl can be called several times, but once this ioctl is called
> > >>>>>> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
> > >>>>>> stop-and-copy phase) and dirty pages bitmap are being queried by user.
> > >>>>>>
> > >>>>> can't agree that VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP can only be
> > >>>>> called in stop-and-copy phase.
> > >>>>> As stated in last version, this will cause QEMU to get a wrong expectation
> > >>>>> of VM downtime and this is also the reason for previously pinned pages
> > >>>>> before log_sync cannot be treated as dirty. If this get bitmap ioctl can
> > >>>>> be called early in save_setup phase, then it's no problem even all ram
> > >>>>> is dirty.
> > >>>>>
> > >>>>
> > >>>> Device can also write to pages which are pinned, and then there is no
> > >>>> way to know pages dirtied by device during pre-copy phase.
> > >>>> If user ask dirty bitmap in per-copy phase, even then user will have to
> > >>>> query dirty bitmap in stop-and-copy phase where this will be superset
> > >>>> including all pages reported during pre-copy. Then instead of copying
> > >>>> all pages twice, its better to do it once during stop-and-copy phase.
> > >>>>
> > >>> I think the flow should be like this:
> > >>> 1. save_setup --> GET_BITMAP ioctl --> return bitmap for currently + previously
> > >>> pinned pages and clean all previously pinned pages
> > >>>
> > >>> 2. save_pending --> GET_BITMAP ioctl  --> return bitmap of (currently
> > >>> pinned pages + previously pinned pages since last clean) and clean all
> > >>> previously pinned pages
> > >>>
> > >>> 3. save_complete_precopy --> GET_BITMAP ioctl --> return bitmap of (currently
> > >>> pinned pages + previously pinned pages since last clean) and clean all
> > >>> previously pinned pages
> > >>>
> > >>>
> > >>> Copying pinned pages multiple times is unavoidable because those pinned pages
> > >>> are always treated as dirty. That's per vendor's implementation.
> > >>> But if the pinned pages are not reported as dirty before stop-and-copy phase,
> > >>> QEMU would think dirty pages has converged
> > >>> and enter blackout phase, making downtime_limit severely incorrect.
> > >>
> > >> I'm not sure it's any worse.
> > >> I *think* we do a last sync after we've decided to go to stop-and-copy;
> > >> wont that then mark all those pages as dirty again, so it'll have the
> > >> same behaviour?
> > > No. something will be different.
> > > currently, in kirti's implementation, if GET_BITMAP ioctl is called only
> > > once in stop-and-copy phase, then before that phase, QEMU does not know those
> > > pages are dirty.
> > > If we can report those dirty pages earlier before stop-and-copy phase,
> > > QEMU can at least copy other pages to reduce dirty pages to below threshold.
> > > 
> > > Take a example, let's assume those vfio dirty pages is 1Gb, and network speed is
> > > also 1Gb. Expected vm downtime is 1s.
> > > If before stop-and-copy phase, dirty pages produced by other pages is
> > > also 1Gb. To meet the expected vm downtime, QEMU should copy pages to
> > > let dirty pages be less than 1Gb, otherwise, it should not complete live
> > > migration.
> > > If vfio does not report this 1Gb dirty pages, QEMU would think there's
> > > only 1Gb and stop the vm. It would then find out there's actually 2Gb and vm
> > > downtime is 2s.
> > > Though the expected vm downtime is always not exactly the same as the
> > > true vm downtime, it should be caused by rapid dirty page rate, which is
> > > not predictable.
> > > Right?
> > > 
> > 
> > If you report vfio dirty pages 1Gb before stop-and-copy phase (i.e. in 
> > pre-copy phase), enter into stop-and-copy phase, how will you know which 
> > and how many pages are dirtied by device from the time when pages copied 
> > in pre-copy phase to that time where device is stopped? You don't have a 
> > way to know which pages are dirtied by device. So ideally device can 
> > write to all pages which are pinned. Then we have to mark all those 
> > pinned pages dirty in stop-and-copy phase, 1Gb, and copy to destination. 
> > Now you had copied same pages twice. Shouldn't we try not to copy pages 
> > twice?
> >
> For mdevs who reply on treating all pinned pages as dirty pages, as above
> mentioned condition, repeated page copying can be avoided by
> (1) adding an ioctl to get size of dirty pages and report it to QEMU through
> vfio_save_pending() interface
> (2) in this GET_BITMAP ioctl, empty bitmap is returned until stop-and-copy phase.
> 
> But for devices who know fine-grained dirty pages, (e.g. devices have ditry page
> tracking in hardware), GET_BITMAP ioctl has to return incremental dirty
> bitmaps in each iteration and step (1) should return 0 to avoid 2*size
> of dirty page reported.

I think you're right that something is needed; but it's starting to get
a bit complex.

As well as the size, you need the addresses to know which areas to avoid
- it's not just a simple size, because I think you only care about areas
that the guest has registered/pinned to the device; so it would have to
be a list somehow.  So then if you got that list you'd add the size
to the amount you knew was pending, and avoid sending that area until
stop-and-copy.

However, if it is only areas that the guest has registered, then what
happens if the guest (de)registers an area during the migration process?
That says the list itself has to be refreshed.  So it's getting messy.

Dave

> Thanks
> Yan
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-01-03 19:44                     ` Dr. David Alan Gilbert
@ 2020-01-04  3:53                       ` Yan Zhao
  0 siblings, 0 replies; 44+ messages in thread
From: Yan Zhao @ 2020-01-04  3:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kirti Wankhede, alex.williamson, cjia, Tian, Kevin, Yang, Ziye,
	Liu, Changpeng, Liu, Yi L, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, Ken.Xue, Wang,
	Zhi A, qemu-devel, kvm

On Sat, Jan 04, 2020 at 03:44:34AM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > On Fri, Dec 20, 2019 at 12:21:45AM +0800, Kirti Wankhede wrote:
> > > 
> > > 
> > > On 12/19/2019 6:27 AM, Yan Zhao wrote:
> > > > On Thu, Dec 19, 2019 at 04:05:52AM +0800, Dr. David Alan Gilbert wrote:
> > > >> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > >>> On Tue, Dec 17, 2019 at 07:47:05PM +0800, Kirti Wankhede wrote:
> > > >>>>
> > > >>>>
> > > >>>> On 12/17/2019 3:21 PM, Yan Zhao wrote:
> > > >>>>> On Tue, Dec 17, 2019 at 05:24:14PM +0800, Kirti Wankhede wrote:
> > > >>>>>>>>     
> > > >>>>>>>>     		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > >>>>>>>>     			-EFAULT : 0;
> > > >>>>>>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> > > >>>>>>>> +		struct vfio_iommu_type1_dirty_bitmap range;
> > > >>>>>>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> > > >>>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> > > >>>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> > > >>>>>>>> +		int ret;
> > > >>>>>>>> +
> > > >>>>>>>> +		if (!iommu->v2)
> > > >>>>>>>> +			return -EACCES;
> > > >>>>>>>> +
> > > >>>>>>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> > > >>>>>>>> +				    bitmap);
> > > >>>>>>>> +
> > > >>>>>>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> > > >>>>>>>> +			return -EFAULT;
> > > >>>>>>>> +
> > > >>>>>>>> +		if (range.argsz < minsz || range.flags & ~mask)
> > > >>>>>>>> +			return -EINVAL;
> > > >>>>>>>> +
> > > >>>>>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> > > >>>>>>>> +			iommu->dirty_page_tracking = true;
> > > >>>>>>>> +			return 0;
> > > >>>>>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> > > >>>>>>>> +			iommu->dirty_page_tracking = false;
> > > >>>>>>>> +
> > > >>>>>>>> +			mutex_lock(&iommu->lock);
> > > >>>>>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
> > > >>>>>>>> +			mutex_unlock(&iommu->lock);
> > > >>>>>>>> +			return 0;
> > > >>>>>>>> +
> > > >>>>>>>> +		} else if (range.flags &
> > > >>>>>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> > > >>>>>>>> +			uint64_t iommu_pgmask;
> > > >>>>>>>> +			unsigned long pgshift = __ffs(range.pgsize);
> > > >>>>>>>> +			unsigned long *bitmap;
> > > >>>>>>>> +			long bsize;
> > > >>>>>>>> +
> > > >>>>>>>> +			iommu_pgmask =
> > > >>>>>>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > > >>>>>>>> +
> > > >>>>>>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> > > >>>>>>>> +			    (range.pgsize - 1))
> > > >>>>>>>> +				return -EINVAL;
> > > >>>>>>>> +
> > > >>>>>>>> +			if (range.iova & iommu_pgmask)
> > > >>>>>>>> +				return -EINVAL;
> > > >>>>>>>> +			if (!range.size || range.size > SIZE_MAX)
> > > >>>>>>>> +				return -EINVAL;
> > > >>>>>>>> +			if (range.iova + range.size < range.iova)
> > > >>>>>>>> +				return -EINVAL;
> > > >>>>>>>> +
> > > >>>>>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> > > >>>>>>>> +						   range.bitmap_size);
> > > >>>>>>>> +			if (bsize)
> > > >>>>>>>> +				return ret;
> > > >>>>>>>> +
> > > >>>>>>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> > > >>>>>>>> +			if (!bitmap)
> > > >>>>>>>> +				return -ENOMEM;
> > > >>>>>>>> +
> > > >>>>>>>> +			ret = copy_from_user(bitmap,
> > > >>>>>>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> > > >>>>>>>> +			if (ret)
> > > >>>>>>>> +				goto bitmap_exit;
> > > >>>>>>>> +
> > > >>>>>>>> +			iommu->dirty_page_tracking = false;
> > > >>>>>>> why iommu->dirty_page_tracking is false here?
> > > >>>>>>> suppose this ioctl can be called several times.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>> This ioctl can be called several times, but once this ioctl is called
> > > >>>>>> that means vCPUs are stopped and VFIO devices are stopped (i.e. in
> > > >>>>>> stop-and-copy phase) and dirty pages bitmap are being queried by user.
> > > >>>>>>
> > > >>>>> can't agree that VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP can only be
> > > >>>>> called in stop-and-copy phase.
> > > >>>>> As stated in last version, this will cause QEMU to get a wrong expectation
> > > >>>>> of VM downtime and this is also the reason for previously pinned pages
> > > >>>>> before log_sync cannot be treated as dirty. If this get bitmap ioctl can
> > > >>>>> be called early in save_setup phase, then it's no problem even all ram
> > > >>>>> is dirty.
> > > >>>>>
> > > >>>>
> > > >>>> Device can also write to pages which are pinned, and then there is no
> > > >>>> way to know pages dirtied by device during pre-copy phase.
> > > >>>> If user ask dirty bitmap in per-copy phase, even then user will have to
> > > >>>> query dirty bitmap in stop-and-copy phase where this will be superset
> > > >>>> including all pages reported during pre-copy. Then instead of copying
> > > >>>> all pages twice, its better to do it once during stop-and-copy phase.
> > > >>>>
> > > >>> I think the flow should be like this:
> > > >>> 1. save_setup --> GET_BITMAP ioctl --> return bitmap for currently + previously
> > > >>> pinned pages and clean all previously pinned pages
> > > >>>
> > > >>> 2. save_pending --> GET_BITMAP ioctl  --> return bitmap of (currently
> > > >>> pinned pages + previously pinned pages since last clean) and clean all
> > > >>> previously pinned pages
> > > >>>
> > > >>> 3. save_complete_precopy --> GET_BITMAP ioctl --> return bitmap of (currently
> > > >>> pinned pages + previously pinned pages since last clean) and clean all
> > > >>> previously pinned pages
> > > >>>
> > > >>>
> > > >>> Copying pinned pages multiple times is unavoidable because those pinned pages
> > > >>> are always treated as dirty. That's per vendor's implementation.
> > > >>> But if the pinned pages are not reported as dirty before stop-and-copy phase,
> > > >>> QEMU would think dirty pages has converged
> > > >>> and enter blackout phase, making downtime_limit severely incorrect.
> > > >>
> > > >> I'm not sure it's any worse.
> > > >> I *think* we do a last sync after we've decided to go to stop-and-copy;
> > > >> wont that then mark all those pages as dirty again, so it'll have the
> > > >> same behaviour?
> > > > No. something will be different.
> > > > currently, in kirti's implementation, if GET_BITMAP ioctl is called only
> > > > once in stop-and-copy phase, then before that phase, QEMU does not know those
> > > > pages are dirty.
> > > > If we can report those dirty pages earlier before stop-and-copy phase,
> > > > QEMU can at least copy other pages to reduce dirty pages to below threshold.
> > > > 
> > > > Take a example, let's assume those vfio dirty pages is 1Gb, and network speed is
> > > > also 1Gb. Expected vm downtime is 1s.
> > > > If before stop-and-copy phase, dirty pages produced by other pages is
> > > > also 1Gb. To meet the expected vm downtime, QEMU should copy pages to
> > > > let dirty pages be less than 1Gb, otherwise, it should not complete live
> > > > migration.
> > > > If vfio does not report this 1Gb dirty pages, QEMU would think there's
> > > > only 1Gb and stop the vm. It would then find out there's actually 2Gb and vm
> > > > downtime is 2s.
> > > > Though the expected vm downtime is always not exactly the same as the
> > > > true vm downtime, it should be caused by rapid dirty page rate, which is
> > > > not predictable.
> > > > Right?
> > > > 
> > > 
> > > If you report vfio dirty pages 1Gb before stop-and-copy phase (i.e. in 
> > > pre-copy phase), enter into stop-and-copy phase, how will you know which 
> > > and how many pages are dirtied by device from the time when pages copied 
> > > in pre-copy phase to that time where device is stopped? You don't have a 
> > > way to know which pages are dirtied by device. So ideally device can 
> > > write to all pages which are pinned. Then we have to mark all those 
> > > pinned pages dirty in stop-and-copy phase, 1Gb, and copy to destination. 
> > > Now you had copied same pages twice. Shouldn't we try not to copy pages 
> > > twice?
> > >
> > For mdevs who reply on treating all pinned pages as dirty pages, as above
> > mentioned condition, repeated page copying can be avoided by
> > (1) adding an ioctl to get size of dirty pages and report it to QEMU through
> > vfio_save_pending() interface
> > (2) in this GET_BITMAP ioctl, empty bitmap is returned until stop-and-copy phase.
> > 
> > But for devices who know fine-grained dirty pages, (e.g. devices have ditry page
> > tracking in hardware), GET_BITMAP ioctl has to return incremental dirty
> > bitmaps in each iteration and step (1) should return 0 to avoid 2*size
> > of dirty page reported.
> 
> I think you're right that something is needed; but it's starting to get
> a bit complex.
> 
> As well as the size, you need the addresses to know which areas to avoid
> - it's not just a simple size, because I think you only care about areas
> that the guest has registered/pinned to the device; so it would have to
> be a list somehow.  So then if you got that list you'd add the size
> to the amount you knew was pending, and avoid sending that area until
> stop-and-copy.
> 
> However, if it is only areas that the guest has registered, then what
> happens if the guest (de)registers an area during the migration process?
> That says the list itself has to be refreshed.  So it's getting messy.
> 
> Dave
>

hi Dave,
I think this GET_BITMAP interface is only for system memory (not device
internal memory) and Kirti has already maintained a vpfn list to record
dirty pages list. It's constantly refreshed during migration.
a. for devices who do not know fine-grained dirty pages,
they can report dirty page size (in vfio module) and not report dirty bitmap
to qemu ram module until stop-and-copy phase.
As it does not report dirty bitmaps, dirty status of unpinned pages during
migration will not be cleared.
b. for devices who know fine-grained dirty pages,
it can report dirty page size (to qemu ram module, not through vfio module)
by reporting dirty bitmap to qemu ram module.
dirty status of unpinned pages during migraion is cleared after each iteration.

The key is to differentiate whether a device has the ability to track fine-grained
dirty pages.
Though more complex than current implementation, it's right. Agree? :)

Thanks
Yan

> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-02 18:25                 ` Dr. David Alan Gilbert
@ 2020-01-06 23:18                   ` Alex Williamson
  2020-01-07  7:28                     ` Kirti Wankhede
  2020-01-07  9:57                     ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 44+ messages in thread
From: Alex Williamson @ 2020-01-06 23:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kirti Wankhede, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Thu, 2 Jan 2020 18:25:37 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Fri, 20 Dec 2019 01:40:35 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > On 12/19/2019 10:57 PM, Alex Williamson wrote:
> > > 
> > > <Snip>
> > >   
> 
> <snip>
> 
> > > 
> > > If device state it at pre-copy state (011b).
> > > Transition, i.e., write to device state as stop-and-copy state (010b) 
> > > failed, then by previous state I meant device should return pre-copy 
> > > state(011b), i.e. previous state which was successfully set, or as you 
> > > said current state which was successfully set.  
> > 
> > Yes, the point I'm trying to make is that this version of the spec
> > tries to tell the user what they should do upon error according to our
> > current interpretation of the QEMU migration protocol.  We're not
> > defining the QEMU migration protocol, we're defining something that can
> > be used in a way to support that protocol.  So I think we should be
> > concerned with defining our spec, for example my proposal would be: "If
> > a state transition fails the user can read device_state to determine the
> > current state of the device.  This should be the previous state of the
> > device unless the vendor driver has encountered an internal error, in
> > which case the device may report the invalid device_state 110b.  The
> > user must use the device reset ioctl in order to recover the device
> > from this state.  If the device is indicated in a valid device state
> > via reading device_state, the user may attempt to transition the device
> > to any valid state reachable from the current state."  
> 
> We might want to be able to distinguish between:
>   a) The device has failed and needs a reset
>   b) The migration has failed

I think the above provides this.  For Kirti's example above of
transitioning from pre-copy to stop-and-copy, the device could refuse
to transition to stop-and-copy, generating an error on the write() of
device_state.  The user re-reading device_state would allow them to
determine the current device state, still in pre-copy or failed.  Only
the latter would require a device reset.

> If some part of the devices mechanics for migration fail, but the device
> is otherwise operational then we should be able to decide to fail the
> migration without taking the device down, which might be very bad for
> the VM.
> Losing a VM during migration due to a problem with migration really
> annoys users; it's one thing the migration failing, but taking the VM
> out as well really gets to them.
> 
> Having the device automatically transition back to the 'running' state
> seems a bad idea to me; much better to tell the hypervisor and provide
> it with a way to clean up; for example, imagine a system with multiple
> devices that are being migrated, most of them have happily transitioned
> to stop-and-copy, but then the last device decides to fail - so now
> someone is going to have to take all of them back to running.

Right, unless I'm missing one, it seems invalid->running is the only
self transition the device should make, though still by way of user
interaction via the reset ioctl.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-06 23:18                   ` Alex Williamson
@ 2020-01-07  7:28                     ` Kirti Wankhede
  2020-01-07 17:09                       ` Alex Williamson
  2020-01-07  9:57                     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2020-01-07  7:28 UTC (permalink / raw)
  To: Alex Williamson, Dr. David Alan Gilbert
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang, yan.y.zhao,
	qemu-devel, kvm



On 1/7/2020 4:48 AM, Alex Williamson wrote:
> On Thu, 2 Jan 2020 18:25:37 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
>> * Alex Williamson (alex.williamson@redhat.com) wrote:
>>> On Fri, 20 Dec 2019 01:40:35 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 12/19/2019 10:57 PM, Alex Williamson wrote:
>>>>
>>>> <Snip>
>>>>    
>>
>> <snip>
>>
>>>>
>>>> If device state it at pre-copy state (011b).
>>>> Transition, i.e., write to device state as stop-and-copy state (010b)
>>>> failed, then by previous state I meant device should return pre-copy
>>>> state(011b), i.e. previous state which was successfully set, or as you
>>>> said current state which was successfully set.
>>>
>>> Yes, the point I'm trying to make is that this version of the spec
>>> tries to tell the user what they should do upon error according to our
>>> current interpretation of the QEMU migration protocol.  We're not
>>> defining the QEMU migration protocol, we're defining something that can
>>> be used in a way to support that protocol.  So I think we should be
>>> concerned with defining our spec, for example my proposal would be: "If
>>> a state transition fails the user can read device_state to determine the
>>> current state of the device.  This should be the previous state of the
>>> device unless the vendor driver has encountered an internal error, in
>>> which case the device may report the invalid device_state 110b.  The
>>> user must use the device reset ioctl in order to recover the device
>>> from this state.  If the device is indicated in a valid device state
>>> via reading device_state, the user may attempt to transition the device
>>> to any valid state reachable from the current state."
>>
>> We might want to be able to distinguish between:
>>    a) The device has failed and needs a reset
>>    b) The migration has failed
> 
> I think the above provides this.  For Kirti's example above of
> transitioning from pre-copy to stop-and-copy, the device could refuse
> to transition to stop-and-copy, generating an error on the write() of
> device_state.  The user re-reading device_state would allow them to
> determine the current device state, still in pre-copy or failed.  Only
> the latter would require a device reset.
> 
>> If some part of the devices mechanics for migration fail, but the device
>> is otherwise operational then we should be able to decide to fail the
>> migration without taking the device down, which might be very bad for
>> the VM.
>> Losing a VM during migration due to a problem with migration really
>> annoys users; it's one thing the migration failing, but taking the VM
>> out as well really gets to them.
>>
>> Having the device automatically transition back to the 'running' state
>> seems a bad idea to me; much better to tell the hypervisor and provide
>> it with a way to clean up; for example, imagine a system with multiple
>> devices that are being migrated, most of them have happily transitioned
>> to stop-and-copy, but then the last device decides to fail - so now
>> someone is going to have to take all of them back to running.
> 
> Right, unless I'm missing one, it seems invalid->running is the only
> self transition the device should make, though still by way of user
> interaction via the reset ioctl.  Thanks,
> 

Instead of using invalid state by vendor driver on device failure, I 
think better to reserve one bit in device state which vendor driver can 
set on device failure. When error bit is set, other bits in device state 
should be ignored.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-06 23:18                   ` Alex Williamson
  2020-01-07  7:28                     ` Kirti Wankhede
@ 2020-01-07  9:57                     ` Dr. David Alan Gilbert
  2020-01-07 16:54                       ` Alex Williamson
  1 sibling, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-01-07  9:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Thu, 2 Jan 2020 18:25:37 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Fri, 20 Dec 2019 01:40:35 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > > > On 12/19/2019 10:57 PM, Alex Williamson wrote:
> > > > 
> > > > <Snip>
> > > >   
> > 
> > <snip>
> > 
> > > > 
> > > > If device state it at pre-copy state (011b).
> > > > Transition, i.e., write to device state as stop-and-copy state (010b) 
> > > > failed, then by previous state I meant device should return pre-copy 
> > > > state(011b), i.e. previous state which was successfully set, or as you 
> > > > said current state which was successfully set.  
> > > 
> > > Yes, the point I'm trying to make is that this version of the spec
> > > tries to tell the user what they should do upon error according to our
> > > current interpretation of the QEMU migration protocol.  We're not
> > > defining the QEMU migration protocol, we're defining something that can
> > > be used in a way to support that protocol.  So I think we should be
> > > concerned with defining our spec, for example my proposal would be: "If
> > > a state transition fails the user can read device_state to determine the
> > > current state of the device.  This should be the previous state of the
> > > device unless the vendor driver has encountered an internal error, in
> > > which case the device may report the invalid device_state 110b.  The
> > > user must use the device reset ioctl in order to recover the device
> > > from this state.  If the device is indicated in a valid device state
> > > via reading device_state, the user may attempt to transition the device
> > > to any valid state reachable from the current state."  
> > 
> > We might want to be able to distinguish between:
> >   a) The device has failed and needs a reset
> >   b) The migration has failed
> 
> I think the above provides this.  For Kirti's example above of
> transitioning from pre-copy to stop-and-copy, the device could refuse
> to transition to stop-and-copy, generating an error on the write() of
> device_state.  The user re-reading device_state would allow them to
> determine the current device state, still in pre-copy or failed.  Only
> the latter would require a device reset.

OK - but that doesn't give you any way to figure out 'why' it failed;
I guess I was expecting you to then read an 'error' register to find
out what happened.
Assuming the write() to transition to stop-and-copy fails and you're
still in pre-copy, what's the defined thing you're supposed to do next?
Decide migration has failed and then do a write() to transition to running?

> > If some part of the devices mechanics for migration fail, but the device
> > is otherwise operational then we should be able to decide to fail the
> > migration without taking the device down, which might be very bad for
> > the VM.
> > Losing a VM during migration due to a problem with migration really
> > annoys users; it's one thing the migration failing, but taking the VM
> > out as well really gets to them.
> > 
> > Having the device automatically transition back to the 'running' state
> > seems a bad idea to me; much better to tell the hypervisor and provide
> > it with a way to clean up; for example, imagine a system with multiple
> > devices that are being migrated, most of them have happily transitioned
> > to stop-and-copy, but then the last device decides to fail - so now
> > someone is going to have to take all of them back to running.
> 
> Right, unless I'm missing one, it seems invalid->running is the only
> self transition the device should make, though still by way of user
> interaction via the reset ioctl.  Thanks,
> 
o
Dave
> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-07  9:57                     ` Dr. David Alan Gilbert
@ 2020-01-07 16:54                       ` Alex Williamson
  2020-01-07 17:50                         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-07 16:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kirti Wankhede, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Tue, 7 Jan 2020 09:57:40 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Thu, 2 Jan 2020 18:25:37 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > On Fri, 20 Dec 2019 01:40:35 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >     
> > > > > On 12/19/2019 10:57 PM, Alex Williamson wrote:
> > > > > 
> > > > > <Snip>
> > > > >     
> > > 
> > > <snip>
> > >   
> > > > > 
> > > > > If device state it at pre-copy state (011b).
> > > > > Transition, i.e., write to device state as stop-and-copy state (010b) 
> > > > > failed, then by previous state I meant device should return pre-copy 
> > > > > state(011b), i.e. previous state which was successfully set, or as you 
> > > > > said current state which was successfully set.    
> > > > 
> > > > Yes, the point I'm trying to make is that this version of the spec
> > > > tries to tell the user what they should do upon error according to our
> > > > current interpretation of the QEMU migration protocol.  We're not
> > > > defining the QEMU migration protocol, we're defining something that can
> > > > be used in a way to support that protocol.  So I think we should be
> > > > concerned with defining our spec, for example my proposal would be: "If
> > > > a state transition fails the user can read device_state to determine the
> > > > current state of the device.  This should be the previous state of the
> > > > device unless the vendor driver has encountered an internal error, in
> > > > which case the device may report the invalid device_state 110b.  The
> > > > user must use the device reset ioctl in order to recover the device
> > > > from this state.  If the device is indicated in a valid device state
> > > > via reading device_state, the user may attempt to transition the device
> > > > to any valid state reachable from the current state."    
> > > 
> > > We might want to be able to distinguish between:
> > >   a) The device has failed and needs a reset
> > >   b) The migration has failed  
> > 
> > I think the above provides this.  For Kirti's example above of
> > transitioning from pre-copy to stop-and-copy, the device could refuse
> > to transition to stop-and-copy, generating an error on the write() of
> > device_state.  The user re-reading device_state would allow them to
> > determine the current device state, still in pre-copy or failed.  Only
> > the latter would require a device reset.  
> 
> OK - but that doesn't give you any way to figure out 'why' it failed;
> I guess I was expecting you to then read an 'error' register to find
> out what happened.
> Assuming the write() to transition to stop-and-copy fails and you're
> still in pre-copy, what's the defined thing you're supposed to do next?
> Decide migration has failed and then do a write() to transition to running?

Defining semantics for an error register seems like a project on its
own.  We do have flags, we could use them to add an error register
later, but I think it's only going to rat hole this effort to try to
incorporate that now.  The state machine is fairly small, so in the
scenario you present, I think the user would assume a failure at
pre-copy to stop-and-copy transition would fail the migration and the
device could go back to running state.  If the device then fails to
return to the running state, we might be stuck with a device with
reduced performance or overhead and the user could warn about that and
continue with the device as-is.  The vendor drivers could make use of
-EAGAIN on transition failure to indicate a temporary issue, but
otherwise the user should probably consider it a persistent error until
either a device reset or start of a new migration sequence (ie. return
to running and start over).  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-07  7:28                     ` Kirti Wankhede
@ 2020-01-07 17:09                       ` Alex Williamson
  2020-01-07 17:53                         ` Kirti Wankhede
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-07 17:09 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Dr. David Alan Gilbert, cjia, kevin.tian, ziye.yang,
	changpeng.liu, yi.l.liu, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe, Zhengxiao.zx,
	shuangtai.tst, Ken.Xue, zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Tue, 7 Jan 2020 12:58:22 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 1/7/2020 4:48 AM, Alex Williamson wrote:
> > On Thu, 2 Jan 2020 18:25:37 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> >> * Alex Williamson (alex.williamson@redhat.com) wrote:  
> >>> On Fri, 20 Dec 2019 01:40:35 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> On 12/19/2019 10:57 PM, Alex Williamson wrote:
> >>>>
> >>>> <Snip>
> >>>>      
> >>
> >> <snip>
> >>  
> >>>>
> >>>> If device state it at pre-copy state (011b).
> >>>> Transition, i.e., write to device state as stop-and-copy state (010b)
> >>>> failed, then by previous state I meant device should return pre-copy
> >>>> state(011b), i.e. previous state which was successfully set, or as you
> >>>> said current state which was successfully set.  
> >>>
> >>> Yes, the point I'm trying to make is that this version of the spec
> >>> tries to tell the user what they should do upon error according to our
> >>> current interpretation of the QEMU migration protocol.  We're not
> >>> defining the QEMU migration protocol, we're defining something that can
> >>> be used in a way to support that protocol.  So I think we should be
> >>> concerned with defining our spec, for example my proposal would be: "If
> >>> a state transition fails the user can read device_state to determine the
> >>> current state of the device.  This should be the previous state of the
> >>> device unless the vendor driver has encountered an internal error, in
> >>> which case the device may report the invalid device_state 110b.  The
> >>> user must use the device reset ioctl in order to recover the device
> >>> from this state.  If the device is indicated in a valid device state
> >>> via reading device_state, the user may attempt to transition the device
> >>> to any valid state reachable from the current state."  
> >>
> >> We might want to be able to distinguish between:
> >>    a) The device has failed and needs a reset
> >>    b) The migration has failed  
> > 
> > I think the above provides this.  For Kirti's example above of
> > transitioning from pre-copy to stop-and-copy, the device could refuse
> > to transition to stop-and-copy, generating an error on the write() of
> > device_state.  The user re-reading device_state would allow them to
> > determine the current device state, still in pre-copy or failed.  Only
> > the latter would require a device reset.
> >   
> >> If some part of the devices mechanics for migration fail, but the device
> >> is otherwise operational then we should be able to decide to fail the
> >> migration without taking the device down, which might be very bad for
> >> the VM.
> >> Losing a VM during migration due to a problem with migration really
> >> annoys users; it's one thing the migration failing, but taking the VM
> >> out as well really gets to them.
> >>
> >> Having the device automatically transition back to the 'running' state
> >> seems a bad idea to me; much better to tell the hypervisor and provide
> >> it with a way to clean up; for example, imagine a system with multiple
> >> devices that are being migrated, most of them have happily transitioned
> >> to stop-and-copy, but then the last device decides to fail - so now
> >> someone is going to have to take all of them back to running.  
> > 
> > Right, unless I'm missing one, it seems invalid->running is the only
> > self transition the device should make, though still by way of user
> > interaction via the reset ioctl.  Thanks,
> >   
> 
> Instead of using invalid state by vendor driver on device failure, I 
> think better to reserve one bit in device state which vendor driver can 
> set on device failure. When error bit is set, other bits in device state 
> should be ignored.

Why is a separate bit better?  Saving and Restoring states are mutually
exclusive, so we have an unused and invalid device state already
without burning another bit.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-07 16:54                       ` Alex Williamson
@ 2020-01-07 17:50                         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-01-07 17:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, cohuck, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Tue, 7 Jan 2020 09:57:40 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Thu, 2 Jan 2020 18:25:37 +0000
> > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > >   
> > > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > > On Fri, 20 Dec 2019 01:40:35 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >     
> > > > > > On 12/19/2019 10:57 PM, Alex Williamson wrote:
> > > > > > 
> > > > > > <Snip>
> > > > > >     
> > > > 
> > > > <snip>
> > > >   
> > > > > > 
> > > > > > If device state it at pre-copy state (011b).
> > > > > > Transition, i.e., write to device state as stop-and-copy state (010b) 
> > > > > > failed, then by previous state I meant device should return pre-copy 
> > > > > > state(011b), i.e. previous state which was successfully set, or as you 
> > > > > > said current state which was successfully set.    
> > > > > 
> > > > > Yes, the point I'm trying to make is that this version of the spec
> > > > > tries to tell the user what they should do upon error according to our
> > > > > current interpretation of the QEMU migration protocol.  We're not
> > > > > defining the QEMU migration protocol, we're defining something that can
> > > > > be used in a way to support that protocol.  So I think we should be
> > > > > concerned with defining our spec, for example my proposal would be: "If
> > > > > a state transition fails the user can read device_state to determine the
> > > > > current state of the device.  This should be the previous state of the
> > > > > device unless the vendor driver has encountered an internal error, in
> > > > > which case the device may report the invalid device_state 110b.  The
> > > > > user must use the device reset ioctl in order to recover the device
> > > > > from this state.  If the device is indicated in a valid device state
> > > > > via reading device_state, the user may attempt to transition the device
> > > > > to any valid state reachable from the current state."    
> > > > 
> > > > We might want to be able to distinguish between:
> > > >   a) The device has failed and needs a reset
> > > >   b) The migration has failed  
> > > 
> > > I think the above provides this.  For Kirti's example above of
> > > transitioning from pre-copy to stop-and-copy, the device could refuse
> > > to transition to stop-and-copy, generating an error on the write() of
> > > device_state.  The user re-reading device_state would allow them to
> > > determine the current device state, still in pre-copy or failed.  Only
> > > the latter would require a device reset.  
> > 
> > OK - but that doesn't give you any way to figure out 'why' it failed;
> > I guess I was expecting you to then read an 'error' register to find
> > out what happened.
> > Assuming the write() to transition to stop-and-copy fails and you're
> > still in pre-copy, what's the defined thing you're supposed to do next?
> > Decide migration has failed and then do a write() to transition to running?
> 
> Defining semantics for an error register seems like a project on its
> own.  We do have flags, we could use them to add an error register
> later, but I think it's only going to rat hole this effort to try to
> incorporate that now.

OK, to be honest I didn't really mean for that thing to be used by code
to decide on it's next action, rather to have something to report when
it failed.

> The state machine is fairly small, so in the
> scenario you present, I think the user would assume a failure at
> pre-copy to stop-and-copy transition would fail the migration and the
> device could go back to running state.  If the device then fails to
> return to the running state, we might be stuck with a device with
> reduced performance or overhead and the user could warn about that and
> continue with the device as-is.  The vendor drivers could make use of
> -EAGAIN on transition failure to indicate a temporary issue, but
> otherwise the user should probably consider it a persistent error until
> either a device reset or start of a new migration sequence (ie. return
> to running and start over).  Thanks,

OK as long as we define somewhere that the action on a failed transition
is then try and transitino to running before restarting the VM and fail
the migration.

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-07 17:09                       ` Alex Williamson
@ 2020-01-07 17:53                         ` Kirti Wankhede
  2020-01-07 18:56                           ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2020-01-07 17:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dr. David Alan Gilbert, cjia, kevin.tian, ziye.yang,
	changpeng.liu, yi.l.liu, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe, Zhengxiao.zx,
	shuangtai.tst, Ken.Xue, zhi.a.wang, yan.y.zhao, qemu-devel, kvm



On 1/7/2020 10:39 PM, Alex Williamson wrote:
> On Tue, 7 Jan 2020 12:58:22 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 1/7/2020 4:48 AM, Alex Williamson wrote:
>>> On Thu, 2 Jan 2020 18:25:37 +0000
>>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>>>    
>>>> * Alex Williamson (alex.williamson@redhat.com) wrote:
>>>>> On Fri, 20 Dec 2019 01:40:35 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>       
>>>>>> On 12/19/2019 10:57 PM, Alex Williamson wrote:
>>>>>>
>>>>>> <Snip>
>>>>>>       
>>>>
>>>> <snip>
>>>>   
>>>>>>
>>>>>> If device state it at pre-copy state (011b).
>>>>>> Transition, i.e., write to device state as stop-and-copy state (010b)
>>>>>> failed, then by previous state I meant device should return pre-copy
>>>>>> state(011b), i.e. previous state which was successfully set, or as you
>>>>>> said current state which was successfully set.
>>>>>
>>>>> Yes, the point I'm trying to make is that this version of the spec
>>>>> tries to tell the user what they should do upon error according to our
>>>>> current interpretation of the QEMU migration protocol.  We're not
>>>>> defining the QEMU migration protocol, we're defining something that can
>>>>> be used in a way to support that protocol.  So I think we should be
>>>>> concerned with defining our spec, for example my proposal would be: "If
>>>>> a state transition fails the user can read device_state to determine the
>>>>> current state of the device.  This should be the previous state of the
>>>>> device unless the vendor driver has encountered an internal error, in
>>>>> which case the device may report the invalid device_state 110b.  The
>>>>> user must use the device reset ioctl in order to recover the device
>>>>> from this state.  If the device is indicated in a valid device state
>>>>> via reading device_state, the user may attempt to transition the device
>>>>> to any valid state reachable from the current state."
>>>>
>>>> We might want to be able to distinguish between:
>>>>     a) The device has failed and needs a reset
>>>>     b) The migration has failed
>>>
>>> I think the above provides this.  For Kirti's example above of
>>> transitioning from pre-copy to stop-and-copy, the device could refuse
>>> to transition to stop-and-copy, generating an error on the write() of
>>> device_state.  The user re-reading device_state would allow them to
>>> determine the current device state, still in pre-copy or failed.  Only
>>> the latter would require a device reset.
>>>    
>>>> If some part of the devices mechanics for migration fail, but the device
>>>> is otherwise operational then we should be able to decide to fail the
>>>> migration without taking the device down, which might be very bad for
>>>> the VM.
>>>> Losing a VM during migration due to a problem with migration really
>>>> annoys users; it's one thing the migration failing, but taking the VM
>>>> out as well really gets to them.
>>>>
>>>> Having the device automatically transition back to the 'running' state
>>>> seems a bad idea to me; much better to tell the hypervisor and provide
>>>> it with a way to clean up; for example, imagine a system with multiple
>>>> devices that are being migrated, most of them have happily transitioned
>>>> to stop-and-copy, but then the last device decides to fail - so now
>>>> someone is going to have to take all of them back to running.
>>>
>>> Right, unless I'm missing one, it seems invalid->running is the only
>>> self transition the device should make, though still by way of user
>>> interaction via the reset ioctl.  Thanks,
>>>    
>>
>> Instead of using invalid state by vendor driver on device failure, I
>> think better to reserve one bit in device state which vendor driver can
>> set on device failure. When error bit is set, other bits in device state
>> should be ignored.
> 
> Why is a separate bit better?  Saving and Restoring states are mutually
> exclusive, so we have an unused and invalid device state already
> without burning another bit.  Thanks,
> 

There are 3 invalid states:
  *  101b => Invalid state
  *  110b => Invalid state
  *  111b => Invalid state

why only 110b should be used to report error from vendor driver to 
report error? Aren't we adding more confusions in the interface?

Only 3 bits from 32 bits are yet used, one bit can be spared to 
represent error state from vendor driver.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-07 17:53                         ` Kirti Wankhede
@ 2020-01-07 18:56                           ` Alex Williamson
  2020-01-08 14:59                             ` Cornelia Huck
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-07 18:56 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Dr. David Alan Gilbert, cjia, kevin.tian, ziye.yang,
	changpeng.liu, yi.l.liu, mlevitsk, eskultet, cohuck,
	jonathan.davies, eauger, aik, pasic, felipe, Zhengxiao.zx,
	shuangtai.tst, Ken.Xue, zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Tue, 7 Jan 2020 23:23:17 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 1/7/2020 10:39 PM, Alex Williamson wrote:
> > On Tue, 7 Jan 2020 12:58:22 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 1/7/2020 4:48 AM, Alex Williamson wrote:  
> >>> On Thu, 2 Jan 2020 18:25:37 +0000
> >>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >>>      
> >>>> * Alex Williamson (alex.williamson@redhat.com) wrote:  
> >>>>> On Fri, 20 Dec 2019 01:40:35 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>         
> >>>>>> On 12/19/2019 10:57 PM, Alex Williamson wrote:
> >>>>>>
> >>>>>> <Snip>
> >>>>>>         
> >>>>
> >>>> <snip>
> >>>>     
> >>>>>>
> >>>>>> If device state it at pre-copy state (011b).
> >>>>>> Transition, i.e., write to device state as stop-and-copy state (010b)
> >>>>>> failed, then by previous state I meant device should return pre-copy
> >>>>>> state(011b), i.e. previous state which was successfully set, or as you
> >>>>>> said current state which was successfully set.  
> >>>>>
> >>>>> Yes, the point I'm trying to make is that this version of the spec
> >>>>> tries to tell the user what they should do upon error according to our
> >>>>> current interpretation of the QEMU migration protocol.  We're not
> >>>>> defining the QEMU migration protocol, we're defining something that can
> >>>>> be used in a way to support that protocol.  So I think we should be
> >>>>> concerned with defining our spec, for example my proposal would be: "If
> >>>>> a state transition fails the user can read device_state to determine the
> >>>>> current state of the device.  This should be the previous state of the
> >>>>> device unless the vendor driver has encountered an internal error, in
> >>>>> which case the device may report the invalid device_state 110b.  The
> >>>>> user must use the device reset ioctl in order to recover the device
> >>>>> from this state.  If the device is indicated in a valid device state
> >>>>> via reading device_state, the user may attempt to transition the device
> >>>>> to any valid state reachable from the current state."  
> >>>>
> >>>> We might want to be able to distinguish between:
> >>>>     a) The device has failed and needs a reset
> >>>>     b) The migration has failed  
> >>>
> >>> I think the above provides this.  For Kirti's example above of
> >>> transitioning from pre-copy to stop-and-copy, the device could refuse
> >>> to transition to stop-and-copy, generating an error on the write() of
> >>> device_state.  The user re-reading device_state would allow them to
> >>> determine the current device state, still in pre-copy or failed.  Only
> >>> the latter would require a device reset.
> >>>      
> >>>> If some part of the devices mechanics for migration fail, but the device
> >>>> is otherwise operational then we should be able to decide to fail the
> >>>> migration without taking the device down, which might be very bad for
> >>>> the VM.
> >>>> Losing a VM during migration due to a problem with migration really
> >>>> annoys users; it's one thing the migration failing, but taking the VM
> >>>> out as well really gets to them.
> >>>>
> >>>> Having the device automatically transition back to the 'running' state
> >>>> seems a bad idea to me; much better to tell the hypervisor and provide
> >>>> it with a way to clean up; for example, imagine a system with multiple
> >>>> devices that are being migrated, most of them have happily transitioned
> >>>> to stop-and-copy, but then the last device decides to fail - so now
> >>>> someone is going to have to take all of them back to running.  
> >>>
> >>> Right, unless I'm missing one, it seems invalid->running is the only
> >>> self transition the device should make, though still by way of user
> >>> interaction via the reset ioctl.  Thanks,
> >>>      
> >>
> >> Instead of using invalid state by vendor driver on device failure, I
> >> think better to reserve one bit in device state which vendor driver can
> >> set on device failure. When error bit is set, other bits in device state
> >> should be ignored.  
> > 
> > Why is a separate bit better?  Saving and Restoring states are mutually
> > exclusive, so we have an unused and invalid device state already
> > without burning another bit.  Thanks,
> >   
> 
> There are 3 invalid states:
>   *  101b => Invalid state
>   *  110b => Invalid state
>   *  111b => Invalid state
> 
> why only 110b should be used to report error from vendor driver to 
> report error? Aren't we adding more confusions in the interface?

I think the only chance of confusion is poor documentation.  If we
define all of the above as invalid and then say any invalid state
indicates an error condition, then the burden is on the user to
enumerate all the invalid states.  That's not a good idea.  Instead we
could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently
used but it might be useful some day.  Therefore there are no valid
transitions into or out of this state.  A vendor driver should fail a
write(2) attempting to enter this state.

That leaves 11Xb, where we consider _RESUMING and _SAVING as mutually
exclusive, so neither are likely to ever be valid states.  Logically,
if the device is in a failed state such that it needs to be reset to be
recovered, I would hope the device is not running, so !_RUNNING (110b)
seems appropriate.  I'm not sure we need that level of detail yet
though, so I was actually just assuming both 11Xb states would indicate
an error state and the undefined _RUNNING bit might differentiate
something in the future.

Therefore, I think we'd have:

 * 101b => Reserved
 * 11Xb => Error

Where the device can only self transition into the Error state on a
failed device_state transition and the only exit from the Error state
is via the reset ioctl.  The Reserved state is unreachable.  The vendor
driver must error on device_state writes to enter or exit the Error
state and must error on writes to enter Reserved states.  Is that still
confusing?

> Only 3 bits from 32 bits are yet used, one bit can be spared to 
> represent error state from vendor driver.

I just don't see that it adds any value to use a separate bit, we
already have more error states than we need with the 3 bits we have.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-07 18:56                           ` Alex Williamson
@ 2020-01-08 14:59                             ` Cornelia Huck
  2020-01-08 18:31                               ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Cornelia Huck @ 2020-01-08 14:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, Dr. David Alan Gilbert, cjia, kevin.tian,
	ziye.yang, changpeng.liu, yi.l.liu, mlevitsk, eskultet,
	jonathan.davies, eauger, aik, pasic, felipe, Zhengxiao.zx,
	shuangtai.tst, Ken.Xue, zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Tue, 7 Jan 2020 11:56:02 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 7 Jan 2020 23:23:17 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:

> > There are 3 invalid states:
> >   *  101b => Invalid state
> >   *  110b => Invalid state
> >   *  111b => Invalid state
> > 
> > why only 110b should be used to report error from vendor driver to 
> > report error? Aren't we adding more confusions in the interface?  
> 
> I think the only chance of confusion is poor documentation.  If we
> define all of the above as invalid and then say any invalid state
> indicates an error condition, then the burden is on the user to
> enumerate all the invalid states.  That's not a good idea.  Instead we
> could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently
> used but it might be useful some day.  Therefore there are no valid
> transitions into or out of this state.  A vendor driver should fail a
> write(2) attempting to enter this state.
> 
> That leaves 11Xb, where we consider _RESUMING and _SAVING as mutually
> exclusive, so neither are likely to ever be valid states.  Logically,
> if the device is in a failed state such that it needs to be reset to be
> recovered, I would hope the device is not running, so !_RUNNING (110b)
> seems appropriate.  I'm not sure we need that level of detail yet
> though, so I was actually just assuming both 11Xb states would indicate
> an error state and the undefined _RUNNING bit might differentiate
> something in the future.
> 
> Therefore, I think we'd have:
> 
>  * 101b => Reserved
>  * 11Xb => Error
> 
> Where the device can only self transition into the Error state on a
> failed device_state transition and the only exit from the Error state
> is via the reset ioctl.  The Reserved state is unreachable.  The vendor
> driver must error on device_state writes to enter or exit the Error
> state and must error on writes to enter Reserved states.  Is that still
> confusing?

I think one thing we could do is start to tie the meaning more to the
actual state (bit combination) and less to the individual bits. I.e.

- bit 0 indicates 'running',
- bit 1 indicates 'saving',
- bit 2 indicates 'resuming',
- bits 3-31 are reserved. [Aside: reserved-and-ignored or
  reserved-and-must-be-zero?]

[Note that I don't specify what happens when a bit is set or unset.]

States are then defined as:
000b => stopped state (not saving or resuming)
001b => running state (not saving or resuming)
010b => stop-and-copy state
011b => pre-copy state
100b => resuming state

[Transitions between these states defined, as before.]

101b => reserved [for post-copy; no transitions defined]
111b => reserved [state does not make sense; no transitions defined]
110b => error state [state does not make sense per se, but it does not
        indicate running; transitions into this state *are* possible]

To a 'reserved' state, we can later assign a different meaning (we
could even re-use 111b for a different error state, if needed); while
the error state must always stay the error state.

We should probably use some kind of feature indication to signify
whether a 'reserved' state actually has a meaning. Also, maybe we also
should designate the states > 111b as 'reserved'.

Does that make sense?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-08 14:59                             ` Cornelia Huck
@ 2020-01-08 18:31                               ` Alex Williamson
  2020-01-08 20:41                                 ` Kirti Wankhede
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-08 18:31 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Kirti Wankhede, Dr. David Alan Gilbert, cjia, kevin.tian,
	ziye.yang, changpeng.liu, yi.l.liu, mlevitsk, eskultet,
	jonathan.davies, eauger, aik, pasic, felipe, Zhengxiao.zx,
	shuangtai.tst, Ken.Xue, zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Wed, 8 Jan 2020 15:59:55 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 7 Jan 2020 11:56:02 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Tue, 7 Jan 2020 23:23:17 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:  
> 
> > > There are 3 invalid states:
> > >   *  101b => Invalid state
> > >   *  110b => Invalid state
> > >   *  111b => Invalid state
> > > 
> > > why only 110b should be used to report error from vendor driver to 
> > > report error? Aren't we adding more confusions in the interface?    
> > 
> > I think the only chance of confusion is poor documentation.  If we
> > define all of the above as invalid and then say any invalid state
> > indicates an error condition, then the burden is on the user to
> > enumerate all the invalid states.  That's not a good idea.  Instead we
> > could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently
> > used but it might be useful some day.  Therefore there are no valid
> > transitions into or out of this state.  A vendor driver should fail a
> > write(2) attempting to enter this state.
> > 
> > That leaves 11Xb, where we consider _RESUMING and _SAVING as mutually
> > exclusive, so neither are likely to ever be valid states.  Logically,
> > if the device is in a failed state such that it needs to be reset to be
> > recovered, I would hope the device is not running, so !_RUNNING (110b)
> > seems appropriate.  I'm not sure we need that level of detail yet
> > though, so I was actually just assuming both 11Xb states would indicate
> > an error state and the undefined _RUNNING bit might differentiate
> > something in the future.
> > 
> > Therefore, I think we'd have:
> > 
> >  * 101b => Reserved
> >  * 11Xb => Error
> > 
> > Where the device can only self transition into the Error state on a
> > failed device_state transition and the only exit from the Error state
> > is via the reset ioctl.  The Reserved state is unreachable.  The vendor
> > driver must error on device_state writes to enter or exit the Error
> > state and must error on writes to enter Reserved states.  Is that still
> > confusing?  
> 
> I think one thing we could do is start to tie the meaning more to the
> actual state (bit combination) and less to the individual bits. I.e.
> 
> - bit 0 indicates 'running',
> - bit 1 indicates 'saving',
> - bit 2 indicates 'resuming',
> - bits 3-31 are reserved. [Aside: reserved-and-ignored or
>   reserved-and-must-be-zero?]

This version specified them as:

	Bits 3 - 31 are reserved for future use. User should perform
	read-modify-write operation on this field.

The intention is that the user should not make any assumptions about
the state of the reserved bits, but should preserve them when changing
known bits.  Therefore I think it's ignored but preserved.  If we
specify them as zero, then I think we lose any chance to define them
later.

> [Note that I don't specify what happens when a bit is set or unset.]
> 
> States are then defined as:
> 000b => stopped state (not saving or resuming)
> 001b => running state (not saving or resuming)
> 010b => stop-and-copy state
> 011b => pre-copy state
> 100b => resuming state
> 
> [Transitions between these states defined, as before.]
> 
> 101b => reserved [for post-copy; no transitions defined]
> 111b => reserved [state does not make sense; no transitions defined]
> 110b => error state [state does not make sense per se, but it does not
>         indicate running; transitions into this state *are* possible]
> 
> To a 'reserved' state, we can later assign a different meaning (we
> could even re-use 111b for a different error state, if needed); while
> the error state must always stay the error state.
> 
> We should probably use some kind of feature indication to signify
> whether a 'reserved' state actually has a meaning. Also, maybe we also
> should designate the states > 111b as 'reserved'.
> 
> Does that make sense?

It seems you have an opinion to restrict this particular error state to
110b rather than 11Xb, reserving 111b for some future error condition.
That's fine and I think we agree that using the state with _RUNNING set
to zero is more logical as we expect the device to be non-operational
in this state.

I'm also thinking more of these as states, but at the same time we're
not doing away with the bit definitions.  I think the states are much
easier to decode and use if we think about the function of each bit,
which leads to the logical incongruity that the 11Xb states are
impossible and therefore must be error states.

I took a look at drawing a state transitions diagram, but I think we're
fully interconnected for the 6 states we're defining.  The user can
invoke transition to any of the 5 states Connie lists above from any of
those states and the 6th error state is only reached via failed
transition and only exited via device reset, returning the user to the
running state.  There are a couple transitions of questionable value,
particularly 01Xb -> 100b (_SAVING -> _RESUMING), but I can't convince
myself that it's worthwhile to force the user to pass through another
state in order to restrict those.  Are there any cases I'm missing
where the vendor driver has good reason not to support arbitrary
transitions between the above 5 states?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-08 18:31                               ` Alex Williamson
@ 2020-01-08 20:41                                 ` Kirti Wankhede
  2020-01-08 22:44                                   ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Kirti Wankhede @ 2020-01-08 20:41 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck
  Cc: Dr. David Alan Gilbert, cjia, kevin.tian, ziye.yang,
	changpeng.liu, yi.l.liu, mlevitsk, eskultet, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm



On 1/9/2020 12:01 AM, Alex Williamson wrote:
> On Wed, 8 Jan 2020 15:59:55 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> On Tue, 7 Jan 2020 11:56:02 -0700
>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>
>>> On Tue, 7 Jan 2020 23:23:17 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>>>> There are 3 invalid states:
>>>>    *  101b => Invalid state
>>>>    *  110b => Invalid state
>>>>    *  111b => Invalid state
>>>>
>>>> why only 110b should be used to report error from vendor driver to
>>>> report error? Aren't we adding more confusions in the interface?
>>>
>>> I think the only chance of confusion is poor documentation.  If we
>>> define all of the above as invalid and then say any invalid state
>>> indicates an error condition, then the burden is on the user to
>>> enumerate all the invalid states.  That's not a good idea.  Instead we
>>> could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently
>>> used but it might be useful some day.  Therefore there are no valid
>>> transitions into or out of this state.  A vendor driver should fail a
>>> write(2) attempting to enter this state.
>>>
>>> That leaves 11Xb, where we consider _RESUMING and _SAVING as mutually
>>> exclusive, so neither are likely to ever be valid states.  Logically,
>>> if the device is in a failed state such that it needs to be reset to be
>>> recovered, I would hope the device is not running, so !_RUNNING (110b)
>>> seems appropriate.  I'm not sure we need that level of detail yet
>>> though, so I was actually just assuming both 11Xb states would indicate
>>> an error state and the undefined _RUNNING bit might differentiate
>>> something in the future.
>>>
>>> Therefore, I think we'd have:
>>>
>>>   * 101b => Reserved
>>>   * 11Xb => Error
>>>
>>> Where the device can only self transition into the Error state on a
>>> failed device_state transition and the only exit from the Error state
>>> is via the reset ioctl.  The Reserved state is unreachable.  The vendor
>>> driver must error on device_state writes to enter or exit the Error
>>> state and must error on writes to enter Reserved states.  Is that still
>>> confusing?
>>
>> I think one thing we could do is start to tie the meaning more to the
>> actual state (bit combination) and less to the individual bits. I.e.
>>
>> - bit 0 indicates 'running',
>> - bit 1 indicates 'saving',
>> - bit 2 indicates 'resuming',
>> - bits 3-31 are reserved. [Aside: reserved-and-ignored or
>>    reserved-and-must-be-zero?]
> 
> This version specified them as:
> 
> 	Bits 3 - 31 are reserved for future use. User should perform
> 	read-modify-write operation on this field.
> 
> The intention is that the user should not make any assumptions about
> the state of the reserved bits, but should preserve them when changing
> known bits.  Therefore I think it's ignored but preserved.  If we
> specify them as zero, then I think we lose any chance to define them
> later.
> 
>> [Note that I don't specify what happens when a bit is set or unset.]
>>
>> States are then defined as:
>> 000b => stopped state (not saving or resuming)
>> 001b => running state (not saving or resuming)
>> 010b => stop-and-copy state
>> 011b => pre-copy state
>> 100b => resuming state
>>
>> [Transitions between these states defined, as before.]
>>
>> 101b => reserved [for post-copy; no transitions defined]
>> 111b => reserved [state does not make sense; no transitions defined]
>> 110b => error state [state does not make sense per se, but it does not
>>          indicate running; transitions into this state *are* possible]
>>
>> To a 'reserved' state, we can later assign a different meaning (we
>> could even re-use 111b for a different error state, if needed); while
>> the error state must always stay the error state.
>>
>> We should probably use some kind of feature indication to signify
>> whether a 'reserved' state actually has a meaning. Also, maybe we also
>> should designate the states > 111b as 'reserved'.
>>
>> Does that make sense?
> 
> It seems you have an opinion to restrict this particular error state to
> 110b rather than 11Xb, reserving 111b for some future error condition.
> That's fine and I think we agree that using the state with _RUNNING set
> to zero is more logical as we expect the device to be non-operational
> in this state.
> 
> I'm also thinking more of these as states, but at the same time we're
> not doing away with the bit definitions.  I think the states are much
> easier to decode and use if we think about the function of each bit,
> which leads to the logical incongruity that the 11Xb states are
> impossible and therefore must be error states.
> 

I agree on bit definition is better.

Ok. Should there be a defined value for error, which can be used by 
vendor driver for error state?

#define VFIO_DEVICE_STATE_ERROR			\
		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)

Thanks,
Kirti

> I took a look at drawing a state transitions diagram, but I think we're
> fully interconnected for the 6 states we're defining.  The user can
> invoke transition to any of the 5 states Connie lists above from any of
> those states and the 6th error state is only reached via failed
> transition and only exited via device reset, returning the user to the
> running state.  There are a couple transitions of questionable value,
> particularly 01Xb -> 100b (_SAVING -> _RESUMING), but I can't convince
> myself that it's worthwhile to force the user to pass through another
> state in order to restrict those.  Are there any cases I'm missing
> where the vendor driver has good reason not to support arbitrary
> transitions between the above 5 states?  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-08 20:41                                 ` Kirti Wankhede
@ 2020-01-08 22:44                                   ` Alex Williamson
  2020-01-10 14:21                                     ` Cornelia Huck
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-08 22:44 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Cornelia Huck, Dr. David Alan Gilbert, cjia, kevin.tian,
	ziye.yang, changpeng.liu, yi.l.liu, mlevitsk, eskultet,
	jonathan.davies, eauger, aik, pasic, felipe, Zhengxiao.zx,
	shuangtai.tst, Ken.Xue, zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Thu, 9 Jan 2020 02:11:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 1/9/2020 12:01 AM, Alex Williamson wrote:
> > On Wed, 8 Jan 2020 15:59:55 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> >> On Tue, 7 Jan 2020 11:56:02 -0700
> >> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>  
> >>> On Tue, 7 Jan 2020 23:23:17 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:  
> >>  
> >>>> There are 3 invalid states:
> >>>>    *  101b => Invalid state
> >>>>    *  110b => Invalid state
> >>>>    *  111b => Invalid state
> >>>>
> >>>> why only 110b should be used to report error from vendor driver to
> >>>> report error? Aren't we adding more confusions in the interface?  
> >>>
> >>> I think the only chance of confusion is poor documentation.  If we
> >>> define all of the above as invalid and then say any invalid state
> >>> indicates an error condition, then the burden is on the user to
> >>> enumerate all the invalid states.  That's not a good idea.  Instead we
> >>> could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently
> >>> used but it might be useful some day.  Therefore there are no valid
> >>> transitions into or out of this state.  A vendor driver should fail a
> >>> write(2) attempting to enter this state.
> >>>
> >>> That leaves 11Xb, where we consider _RESUMING and _SAVING as mutually
> >>> exclusive, so neither are likely to ever be valid states.  Logically,
> >>> if the device is in a failed state such that it needs to be reset to be
> >>> recovered, I would hope the device is not running, so !_RUNNING (110b)
> >>> seems appropriate.  I'm not sure we need that level of detail yet
> >>> though, so I was actually just assuming both 11Xb states would indicate
> >>> an error state and the undefined _RUNNING bit might differentiate
> >>> something in the future.
> >>>
> >>> Therefore, I think we'd have:
> >>>
> >>>   * 101b => Reserved
> >>>   * 11Xb => Error
> >>>
> >>> Where the device can only self transition into the Error state on a
> >>> failed device_state transition and the only exit from the Error state
> >>> is via the reset ioctl.  The Reserved state is unreachable.  The vendor
> >>> driver must error on device_state writes to enter or exit the Error
> >>> state and must error on writes to enter Reserved states.  Is that still
> >>> confusing?  
> >>
> >> I think one thing we could do is start to tie the meaning more to the
> >> actual state (bit combination) and less to the individual bits. I.e.
> >>
> >> - bit 0 indicates 'running',
> >> - bit 1 indicates 'saving',
> >> - bit 2 indicates 'resuming',
> >> - bits 3-31 are reserved. [Aside: reserved-and-ignored or
> >>    reserved-and-must-be-zero?]  
> > 
> > This version specified them as:
> > 
> > 	Bits 3 - 31 are reserved for future use. User should perform
> > 	read-modify-write operation on this field.
> > 
> > The intention is that the user should not make any assumptions about
> > the state of the reserved bits, but should preserve them when changing
> > known bits.  Therefore I think it's ignored but preserved.  If we
> > specify them as zero, then I think we lose any chance to define them
> > later.
> >   
> >> [Note that I don't specify what happens when a bit is set or unset.]
> >>
> >> States are then defined as:
> >> 000b => stopped state (not saving or resuming)
> >> 001b => running state (not saving or resuming)
> >> 010b => stop-and-copy state
> >> 011b => pre-copy state
> >> 100b => resuming state
> >>
> >> [Transitions between these states defined, as before.]
> >>
> >> 101b => reserved [for post-copy; no transitions defined]
> >> 111b => reserved [state does not make sense; no transitions defined]
> >> 110b => error state [state does not make sense per se, but it does not
> >>          indicate running; transitions into this state *are* possible]
> >>
> >> To a 'reserved' state, we can later assign a different meaning (we
> >> could even re-use 111b for a different error state, if needed); while
> >> the error state must always stay the error state.
> >>
> >> We should probably use some kind of feature indication to signify
> >> whether a 'reserved' state actually has a meaning. Also, maybe we also
> >> should designate the states > 111b as 'reserved'.
> >>
> >> Does that make sense?  
> > 
> > It seems you have an opinion to restrict this particular error state to
> > 110b rather than 11Xb, reserving 111b for some future error condition.
> > That's fine and I think we agree that using the state with _RUNNING set
> > to zero is more logical as we expect the device to be non-operational
> > in this state.
> > 
> > I'm also thinking more of these as states, but at the same time we're
> > not doing away with the bit definitions.  I think the states are much
> > easier to decode and use if we think about the function of each bit,
> > which leads to the logical incongruity that the 11Xb states are
> > impossible and therefore must be error states.
> >   
> 
> I agree on bit definition is better.
> 
> Ok. Should there be a defined value for error, which can be used by 
> vendor driver for error state?
> 
> #define VFIO_DEVICE_STATE_ERROR			\
> 		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)

Seems like a good idea for consistency.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
  2020-01-08 22:44                                   ` Alex Williamson
@ 2020-01-10 14:21                                     ` Cornelia Huck
  0 siblings, 0 replies; 44+ messages in thread
From: Cornelia Huck @ 2020-01-10 14:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, Dr. David Alan Gilbert, cjia, kevin.tian,
	ziye.yang, changpeng.liu, yi.l.liu, mlevitsk, eskultet,
	jonathan.davies, eauger, aik, pasic, felipe, Zhengxiao.zx,
	shuangtai.tst, Ken.Xue, zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Wed, 8 Jan 2020 15:44:28 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 9 Jan 2020 02:11:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 1/9/2020 12:01 AM, Alex Williamson wrote:  
> > > On Wed, 8 Jan 2020 15:59:55 +0100
> > > Cornelia Huck <cohuck@redhat.com> wrote:

> > >> I think one thing we could do is start to tie the meaning more to the
> > >> actual state (bit combination) and less to the individual bits. I.e.
> > >>
> > >> - bit 0 indicates 'running',
> > >> - bit 1 indicates 'saving',
> > >> - bit 2 indicates 'resuming',
> > >> - bits 3-31 are reserved. [Aside: reserved-and-ignored or
> > >>    reserved-and-must-be-zero?]    
> > > 
> > > This version specified them as:
> > > 
> > > 	Bits 3 - 31 are reserved for future use. User should perform
> > > 	read-modify-write operation on this field.
> > > 
> > > The intention is that the user should not make any assumptions about
> > > the state of the reserved bits, but should preserve them when changing
> > > known bits.  Therefore I think it's ignored but preserved.  If we
> > > specify them as zero, then I think we lose any chance to define them
> > > later.

Nod. What about extending the description to:

"Bits 3-31 are reserved for future use. In order to preserve them, a
read-modify-write operation on this field should be used when modifying
the specified bits."

?

> > >     
> > >> [Note that I don't specify what happens when a bit is set or unset.]
> > >>
> > >> States are then defined as:
> > >> 000b => stopped state (not saving or resuming)
> > >> 001b => running state (not saving or resuming)
> > >> 010b => stop-and-copy state
> > >> 011b => pre-copy state
> > >> 100b => resuming state
> > >>
> > >> [Transitions between these states defined, as before.]
> > >>
> > >> 101b => reserved [for post-copy; no transitions defined]
> > >> 111b => reserved [state does not make sense; no transitions defined]
> > >> 110b => error state [state does not make sense per se, but it does not
> > >>          indicate running; transitions into this state *are* possible]
> > >>
> > >> To a 'reserved' state, we can later assign a different meaning (we
> > >> could even re-use 111b for a different error state, if needed); while
> > >> the error state must always stay the error state.
> > >>
> > >> We should probably use some kind of feature indication to signify
> > >> whether a 'reserved' state actually has a meaning. Also, maybe we also
> > >> should designate the states > 111b as 'reserved'.
> > >>
> > >> Does that make sense?    
> > > 
> > > It seems you have an opinion to restrict this particular error state to
> > > 110b rather than 11Xb, reserving 111b for some future error condition.
> > > That's fine and I think we agree that using the state with _RUNNING set
> > > to zero is more logical as we expect the device to be non-operational
> > > in this state.

Good.

> > > 
> > > I'm also thinking more of these as states, but at the same time we're
> > > not doing away with the bit definitions.  I think the states are much
> > > easier to decode and use if we think about the function of each bit,
> > > which leads to the logical incongruity that the 11Xb states are
> > > impossible and therefore must be error states.

Yes, that's fine.

> > >     
> > 
> > I agree on bit definition is better.
> > 
> > Ok. Should there be a defined value for error, which can be used by 
> > vendor driver for error state?
> > 
> > #define VFIO_DEVICE_STATE_ERROR			\
> > 		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)  
> 
> Seems like a good idea for consistency.  Thanks,
> 
> Alex

Agreed, I like that as well.


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-01-10 14:21 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-16 20:21 [PATCH v10 Kernel 0/5] KABIs to support migration for VFIO devices Kirti Wankhede
2019-12-16 20:21 ` [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
2019-12-16 22:44   ` Alex Williamson
2019-12-17  6:28     ` Kirti Wankhede
2019-12-17  7:12       ` Yan Zhao
2019-12-17 18:43       ` Alex Williamson
2019-12-19 16:08         ` Kirti Wankhede
2019-12-19 17:27           ` Alex Williamson
2019-12-19 20:10             ` Kirti Wankhede
2019-12-19 21:09               ` Alex Williamson
2020-01-02 18:25                 ` Dr. David Alan Gilbert
2020-01-06 23:18                   ` Alex Williamson
2020-01-07  7:28                     ` Kirti Wankhede
2020-01-07 17:09                       ` Alex Williamson
2020-01-07 17:53                         ` Kirti Wankhede
2020-01-07 18:56                           ` Alex Williamson
2020-01-08 14:59                             ` Cornelia Huck
2020-01-08 18:31                               ` Alex Williamson
2020-01-08 20:41                                 ` Kirti Wankhede
2020-01-08 22:44                                   ` Alex Williamson
2020-01-10 14:21                                     ` Cornelia Huck
2020-01-07  9:57                     ` Dr. David Alan Gilbert
2020-01-07 16:54                       ` Alex Williamson
2020-01-07 17:50                         ` Dr. David Alan Gilbert
2019-12-16 20:21 ` [PATCH v10 Kernel 2/5] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2019-12-16 23:16   ` Alex Williamson
2019-12-17  6:32     ` Kirti Wankhede
2019-12-16 20:21 ` [PATCH v10 Kernel 3/5] vfio iommu: Add ioctl defination for dirty pages tracking Kirti Wankhede
2019-12-16 20:21 ` [PATCH v10 Kernel 4/5] vfio iommu: Implementation of ioctl to " Kirti Wankhede
2019-12-17  5:15   ` Yan Zhao
2019-12-17  9:24     ` Kirti Wankhede
2019-12-17  9:51       ` Yan Zhao
2019-12-17 11:47         ` Kirti Wankhede
2019-12-18  1:04           ` Yan Zhao
2019-12-18 20:05             ` Dr. David Alan Gilbert
2019-12-19  0:57               ` Yan Zhao
2019-12-19 16:21                 ` Kirti Wankhede
2019-12-20  0:58                   ` Yan Zhao
2020-01-03 19:44                     ` Dr. David Alan Gilbert
2020-01-04  3:53                       ` Yan Zhao
2019-12-18 21:39       ` Alex Williamson
2019-12-19 18:42         ` Kirti Wankhede
2019-12-19 18:56           ` Alex Williamson
2019-12-16 20:21 ` [PATCH v10 Kernel 5/5] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).