KVM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices
@ 2019-11-12 17:03 Kirti Wankhede
  2019-11-12 17:03 ` [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
                   ` (4 more replies)
  0 siblings, 5 replies; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-12 17:03 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Hi Alex,

To keep kernel and QEMU patches in sync, keeping v9 version for this patch
set. Till v8 version, KABI was being discussed from QEMU patch series[1].
In earlier version mail and as per in person discussion at KVM forum, this
patch set adds:
* New IOCTL VFIO_IOMMU_GET_DIRTY_BITMAP to get dirty pages bitmap with
  respect to IOMMU container rather than per device. All pages pinned by
  vendor driver through vfio_pin_pages external API has to be marked as
  dirty during  migration.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned by vendor driver
  should be considered as dirty.
* New IOCTL VFIO_IOMMU_UNMAP_DMA_GET_BITMAP to get dirty pages bitmap
  before unmapping IO virtual address range.
  With vIOMMU, during pre-copy phase of migration, while CPUs are still
  running, IO virtual address unmap can happen while device still keeping
  reference of guest pfns. Those pages should be reported as dirty before
  unmap, so that VFIO user space application can copy content of those
  pages from source to destination.

Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required such that vendor
driver could report dirty pages to VFIO module during migration phases.

[1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg640400.html

Adding revision history from previous QEMU patch set to understand KABI
changes done till now

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load
  as suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of
  structure vfio_device_migration_info to be followed based on Alex's
  suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration,
  that is removed assumption that data will be till end of migration
  region.
- If vendor driver supports mappable sparsed region, map those region
  during setup state of save/load, similarly unmap those from cleanup
  routines.
- Handles race condition that causes data corruption in migration region
  during save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Split into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2
  bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined
  action on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with
  region type capability.
- Re-structured vfio_device_migration_info. This structure will be placed
  at 0th offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section
  of the region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested
  pages are copied.

Thanks,
Kirti

Kirti Wankhede (5):
  vfio: KABI for migration interface for device state
  vfio iommu: Add ioctl defination to get dirty pages bitmap.
  vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap
  vfio iommu: Implementation of ioctl to get dirty pages bitmap.
  vfio iommu: Implementation of ioctl to get dirty bitmap before unmap

 drivers/vfio/vfio_iommu_type1.c | 163 ++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       | 164 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 325 insertions(+), 2 deletions(-)

-- 
2.7.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-12 17:03 [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices Kirti Wankhede
@ 2019-11-12 17:03 ` Kirti Wankhede
  2019-11-12 22:30   ` Alex Williamson
  2019-11-12 17:03 ` [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap Kirti Wankhede
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-12 17:03 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

- Defined MIGRATION region type and sub-type.
- Used 3 bits to define VFIO device states.
    Bit 0 => _RUNNING
    Bit 1 => _SAVING
    Bit 2 => _RESUMING
    Combination of these bits defines VFIO device's state during migration
    _RUNNING => Normal VFIO device running state. When its reset, it
		indicates _STOPPED state. when device is changed to
		_STOPPED, driver should stop device before write()
		returns.
    _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
                          start saving state of device i.e. pre-copy state
    _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
                save device state,i.e. stop-n-copy state
    _RESUMING => VFIO device resuming state.
    _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
    Bits 3 - 31 are reserved for future use. User should perform
    read-modify-write operation on this field.
- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access:
    * device_state: (read/write)
        To convey VFIO device state to be transitioned to. Only 3 bits are
	used as of now, Bits 3 - 31 are reserved for future use.
    * pending bytes: (read only)
        To get pending bytes yet to be migrated for VFIO device.
    * data_offset: (read only)
        To get data offset in migration region from where data exist
	during _SAVING and from where data should be written by user space
	application during _RESUMING state.
    * data_size: (read/write)
        To get and set size in bytes of data copied in migration region
	during _SAVING and _RESUMING state.

Migration region looks like:
 ------------------------------------------------------------------
|vfio_device_migration_info|    data section                      |
|                          |     ///////////////////////////////  |
 ------------------------------------------------------------------
 ^                              ^
 offset 0-trapped part        data_offset

Structure vfio_device_migration_info is always followed by data section
in the region, so data_offset will always be non-0. Offset from where data
to be copied is decided by kernel driver, data section can be trapped or
mapped depending on how kernel driver defines data section.
Data section partition can be defined as mapped by sparse mmap capability.
If mmapped, then data_offset should be page aligned, where as initial
section which contain vfio_device_migration_info structure might not end
at offset which is page aligned.
Vendor driver should decide whether to partition data section and how to
partition the data section. Vendor driver should return data_offset
accordingly.

For user application, data is opaque. User should write data in the same
order as received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..35b09427ad9f 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
+#define VFIO_REGION_TYPE_MIGRATION              (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
 /* sub-types for VFIO_REGION_TYPE_CCW */
 #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
 
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
+
+/*
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      To indicate vendor driver the state VFIO device should be transitioned
+ *      to. If device state transition fails, write on this field return error.
+ *      It consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
+ *        _STOPPED state. When device is changed to _STOPPED, driver should stop
+ *        device before write() returns.
+ *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
+ *        should start gathering device state information which will be provided
+ *        to VFIO user space application to save device's state.
+ *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
+ *        prepare to resume device, data provided through migration region
+ *        should be used to resume device.
+ *      Bits 3 - 31 are reserved for future use. User should perform
+ *      read-modify-write operation on this field.
+ *      _SAVING and _RESUMING bits set at the same time is invalid state.
+ *	Similarly _RUNNING and _RESUMING bits set is invalid state.
+ *
+ * pending bytes: (read only)
+ *      Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read device data during _SAVING state or write
+ *      device data during _RESUMING state. See below for detail of sequence to
+ *      be followed.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to get size of data copied in
+ *      bytes in migration region during _SAVING state and write size of data
+ *      copied in bytes in migration region during _RESUMING state.
+ *
+ * Migration region looks like:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^
+ *  offset 0-trapped part        data_offset
+ *
+ * Structure vfio_device_migration_info is always followed by data section in
+ * the region, so data_offset will always be non-0. Offset from where data is
+ * copied is decided by kernel driver, data section can be trapped or mapped
+ * or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes. If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates kernel driver to write data to staging buffer.
+ *    Kernel driver should return this read operation only after writing data to
+ *    staging buffer is done.
+ * c. read data_size, amount of data in bytes written by vendor driver in
+ *    migration region.
+ * d. read data_size bytes of data from data_offset in the migration region.
+ * e. process data.
+ * f. Loop through a to e. Next read on pending_bytes indicates that read data
+ *    operation from migration region for previous iteration is done.
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
+ * b. write data of data_size to migration region from data_offset.
+ * c. write data_size which indicates vendor driver that data is written in
+ *    staging buffer. Vendor driver should read this data from migration
+ *    region and resume device's state.
+ *
+ * For user application, data is opaque. User should write data in the same
+ * order as received.
+ */
+
+struct vfio_device_migration_info {
+	__u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+				     VFIO_DEVICE_STATE_SAVING |  \
+				     VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
+					    VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
+					    VFIO_DEVICE_STATE_RESUMING)
+	__u32 reserved;
+	__u64 pending_bytes;
+	__u64 data_offset;
+	__u64 data_size;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.7.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-12 17:03 [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices Kirti Wankhede
  2019-11-12 17:03 ` [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
@ 2019-11-12 17:03 ` Kirti Wankhede
  2019-11-12 22:30   ` Alex Williamson
  2019-11-12 17:03 ` [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap Kirti Wankhede
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-12 17:03 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

All pages pinned by vendor driver through vfio_pin_pages API should be
considered as dirty during migration. IOMMU container maintains a list of
all such pinned pages. Added an ioctl defination to get bitmap of such
pinned pages for requested IO virtual address range.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 35b09427ad9f..6fd3822aa610 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
+ *                                     struct vfio_iommu_type1_dirty_bitmap)
+ *
+ * IOCTL to get dirty pages bitmap for IOMMU container during migration.
+ * Get dirty pages bitmap of given IO virtual addresses range using
+ * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
+ * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
+ * bitmap and should set size of allocated memory in bitmap_size field.
+ * One bit is used to represent per page consecutively starting from iova
+ * offset. Bit set indicates page at that offset from iova is dirty.
+ */
+struct vfio_iommu_type1_dirty_bitmap {
+	__u32        argsz;
+	__u32        flags;
+	__u64        iova;                      /* IO virtual address */
+	__u64        size;                      /* Size of iova range */
+	__u64        bitmap_size;               /* in bytes */
+	void __user *bitmap;                    /* one bit per page */
+};
+
+#define VFIO_IOMMU_GET_DIRTY_BITMAP             _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap
  2019-11-12 17:03 [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices Kirti Wankhede
  2019-11-12 17:03 ` [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
  2019-11-12 17:03 ` [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap Kirti Wankhede
@ 2019-11-12 17:03 ` Kirti Wankhede
  2019-11-12 22:30   ` Alex Williamson
  2019-11-12 17:03 ` [PATCH v9 Kernel 4/5] vfio iommu: Implementation of ioctl to get dirty pages bitmap Kirti Wankhede
  2019-11-12 17:03 ` [PATCH v9 Kernel 5/5] vfio iommu: Implementation of ioctl to get dirty bitmap before unmap Kirti Wankhede
  4 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-12 17:03 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

With vIOMMU, during pre-copy phase of migration, while CPUs are still
running, IO virtual address unmap can happen while device still keeping
reference of guest pfns. Those pages should be reported as dirty before
unmap, so that VFIO user space application can copy content of those pages
from source to destination.

IOCTL defination added here add bitmap pointer, size and flag. If flag
VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set and bitmap memory is allocated
and bitmap_size of set, then ioctl will create bitmap of pinned pages and
then unmap those.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 6fd3822aa610..72fd297baf52 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -925,6 +925,39 @@ struct vfio_iommu_type1_dirty_bitmap {
 
 #define VFIO_IOMMU_GET_DIRTY_BITMAP             _IO(VFIO_TYPE, VFIO_BASE + 17)
 
+/**
+ * VFIO_IOMMU_UNMAP_DMA_GET_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
+ *				      struct vfio_iommu_type1_dma_unmap_bitmap)
+ *
+ * Unmap IO virtual addresses using the provided struct
+ * vfio_iommu_type1_dma_unmap_bitmap.  Caller sets argsz.
+ * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
+ * before unmapping IO virtual addresses. If this flag is not set, only IO
+ * virtual address are unmapped without creating pinned pages bitmap, that
+ * is, behave same as VFIO_IOMMU_UNMAP_DMA ioctl.
+ * User should allocate memory to get bitmap and should set size of allocated
+ * memory in bitmap_size field. One bit in bitmap is used to represent per page
+ * consecutively starting from iova offset. Bit set indicates page at that
+ * offset from iova is dirty.
+ * The actual unmapped size is returned in the size field and bitmap of pages
+ * in the range of unmapped size is returned in bitmap if flag
+ * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set.
+ *
+ * No guarantee is made to the user that arbitrary unmaps of iova or size
+ * different from those used in the original mapping call will succeed.
+ */
+struct vfio_iommu_type1_dma_unmap_bitmap {
+	__u32        argsz;
+	__u32        flags;
+#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
+	__u64        iova;                        /* IO virtual address */
+	__u64        size;                        /* Size of mapping (bytes) */
+	__u64        bitmap_size;                 /* in bytes */
+	void __user *bitmap;                      /* one bit per page */
+};
+
+#define VFIO_IOMMU_UNMAP_DMA_GET_BITMAP _IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v9 Kernel 4/5] vfio iommu: Implementation of ioctl to get dirty pages bitmap.
  2019-11-12 17:03 [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices Kirti Wankhede
                   ` (2 preceding siblings ...)
  2019-11-12 17:03 ` [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap Kirti Wankhede
@ 2019-11-12 17:03 ` Kirti Wankhede
  2019-11-12 22:30   ` Alex Williamson
  2019-11-12 17:03 ` [PATCH v9 Kernel 5/5] vfio iommu: Implementation of ioctl to get dirty bitmap before unmap Kirti Wankhede
  4 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-12 17:03 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

IOMMU container maintains list of external pinned pages. Bitmap of pinned
pages for input IO virtual address range is created and returned.
IO virtual address range should be from a single mapping created by
map request. Input bitmap_size is validated by calculating the size of
requested range.
This ioctl returns bitmap of dirty pages, its user space application
responsibility to copy content of dirty pages from source to destination
during migration.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 92 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..ac176e672857 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -850,6 +850,81 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 	return bitmap;
 }
 
+/*
+ * start_iova is the reference from where bitmaping started. This is called
+ * from DMA_UNMAP where start_iova can be different than iova
+ */
+
+static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
+				  size_t size, dma_addr_t start_iova,
+				  unsigned long *bitmap)
+{
+	struct vfio_dma *dma;
+	dma_addr_t temp_iova = iova;
+
+	dma = vfio_find_dma(iommu, iova, size);
+	if (!dma)
+		return -EINVAL;
+
+	/*
+	 * Range should be from a single mapping created by map request.
+	 */
+
+	if ((iova < dma->iova) ||
+	    ((dma->iova + dma->size) < (iova + size)))
+		return -EINVAL;
+
+	while (temp_iova < iova + size) {
+		struct vfio_pfn *vpfn = NULL;
+
+		vpfn = vfio_find_vpfn(dma, temp_iova);
+		if (vpfn)
+			__bitmap_set(bitmap, vpfn->iova - start_iova, 1);
+
+		temp_iova += PAGE_SIZE;
+	}
+
+	return 0;
+}
+
+static int verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
+{
+	unsigned long bsize = ALIGN(npages, BITS_PER_LONG) / 8;
+
+	if ((bitmap_size == 0) || (bitmap_size < bsize))
+		return -EINVAL;
+	return 0;
+}
+
+static int vfio_iova_get_dirty_bitmap(struct vfio_iommu *iommu,
+				struct vfio_iommu_type1_dirty_bitmap *range)
+{
+	unsigned long *bitmap;
+	int ret;
+
+	ret = verify_bitmap_size(range->size >> PAGE_SHIFT, range->bitmap_size);
+	if (ret)
+		return ret;
+
+	/* one bit per page */
+	bitmap = bitmap_zalloc(range->size >> PAGE_SHIFT, GFP_KERNEL);
+	if (!bitmap)
+		return -ENOMEM;
+
+	mutex_lock(&iommu->lock);
+	ret = vfio_iova_dirty_bitmap(iommu, range->iova, range->size,
+				     range->iova, bitmap);
+	mutex_unlock(&iommu->lock);
+
+	if (!ret) {
+		if (copy_to_user(range->bitmap, bitmap, range->bitmap_size))
+			ret = -EFAULT;
+	}
+
+	bitmap_free(bitmap);
+	return ret;
+}
+
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 			     struct vfio_iommu_type1_dma_unmap *unmap)
 {
@@ -2297,6 +2372,23 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+	} else if (cmd == VFIO_IOMMU_GET_DIRTY_BITMAP) {
+		struct vfio_iommu_type1_dirty_bitmap range;
+
+		/* Supported for v2 version only */
+		if (!iommu->v2)
+			return -EACCES;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
+					bitmap);
+
+		if (copy_from_user(&range, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (range.argsz < minsz)
+			return -EINVAL;
+
+		return vfio_iova_get_dirty_bitmap(iommu, &range);
 	}
 
 	return -ENOTTY;
-- 
2.7.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v9 Kernel 5/5] vfio iommu: Implementation of ioctl to get dirty bitmap before unmap
  2019-11-12 17:03 [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices Kirti Wankhede
                   ` (3 preceding siblings ...)
  2019-11-12 17:03 ` [PATCH v9 Kernel 4/5] vfio iommu: Implementation of ioctl to get dirty pages bitmap Kirti Wankhede
@ 2019-11-12 17:03 ` Kirti Wankhede
  2019-11-12 22:30   ` Alex Williamson
  4 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-12 17:03 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

If pages are pinned by external interface for requested IO virtual address
range, bitmap of such pages is created and then that range is unmapped.
To get bitmap during unmap, user should set flag
VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
bitmap_size should be set. If flag is not set, then it behaves same as
VFIO_IOMMU_UNMAP_DMA ioctl.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 71 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 69 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ac176e672857..d6b988452ba6 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -926,7 +926,8 @@ static int vfio_iova_get_dirty_bitmap(struct vfio_iommu *iommu,
 }
 
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
-			     struct vfio_iommu_type1_dma_unmap *unmap)
+			     struct vfio_iommu_type1_dma_unmap *unmap,
+			     unsigned long *bitmap)
 {
 	uint64_t mask;
 	struct vfio_dma *dma, *dma_last = NULL;
@@ -1026,6 +1027,12 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 						    &nb_unmap);
 			goto again;
 		}
+
+		if (bitmap) {
+			vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size,
+					       unmap->iova, bitmap);
+		}
+
 		unmapped += dma->size;
 		vfio_remove_dma(iommu, dma);
 	}
@@ -1039,6 +1046,43 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_dma_do_unmap_bitmap(struct vfio_iommu *iommu,
+		struct vfio_iommu_type1_dma_unmap_bitmap *unmap_bitmap)
+{
+	struct vfio_iommu_type1_dma_unmap unmap;
+	unsigned long *bitmap = NULL;
+	int ret;
+
+	/* check bitmap size */
+	if ((unmap_bitmap->flags | VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)) {
+		ret = verify_bitmap_size(unmap_bitmap->size >> PAGE_SHIFT,
+					 unmap_bitmap->bitmap_size);
+		if (ret)
+			return ret;
+
+		/* one bit per page */
+		bitmap = bitmap_zalloc(unmap_bitmap->size >> PAGE_SHIFT,
+					GFP_KERNEL);
+		if (!bitmap)
+			return -ENOMEM;
+	}
+
+	unmap.iova = unmap_bitmap->iova;
+	unmap.size = unmap_bitmap->size;
+	ret = vfio_dma_do_unmap(iommu, &unmap, bitmap);
+	if (!ret)
+		unmap_bitmap->size = unmap.size;
+
+	if (bitmap) {
+		if (!ret && copy_to_user(unmap_bitmap->bitmap, bitmap,
+					 unmap_bitmap->bitmap_size))
+			ret = -EFAULT;
+		bitmap_free(bitmap);
+	}
+
+	return ret;
+}
+
 static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 			  unsigned long pfn, long npage, int prot)
 {
@@ -2366,7 +2410,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		if (unmap.argsz < minsz || unmap.flags)
 			return -EINVAL;
 
-		ret = vfio_dma_do_unmap(iommu, &unmap);
+		ret = vfio_dma_do_unmap(iommu, &unmap, NULL);
 		if (ret)
 			return ret;
 
@@ -2389,6 +2433,29 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 			return -EINVAL;
 
 		return vfio_iova_get_dirty_bitmap(iommu, &range);
+	} else if (cmd == VFIO_IOMMU_UNMAP_DMA_GET_BITMAP) {
+		struct vfio_iommu_type1_dma_unmap_bitmap unmap_bitmap;
+		long ret;
+
+		/* Supported for v2 version only */
+		if (!iommu->v2)
+			return -EACCES;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap_bitmap,
+				    bitmap);
+
+		if (copy_from_user(&unmap_bitmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap_bitmap.argsz < minsz)
+			return -EINVAL;
+
+		ret = vfio_dma_do_unmap_bitmap(iommu, &unmap_bitmap);
+		if (ret)
+			return ret;
+
+		return copy_to_user((void __user *)arg, &unmap_bitmap, minsz) ?
+			-EFAULT : 0;
 	}
 
 	return -ENOTTY;
-- 
2.7.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-12 17:03 ` [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
@ 2019-11-12 22:30   ` Alex Williamson
  2019-11-13  3:23     ` Yan Zhao
  2019-11-13 10:24     ` Cornelia Huck
  0 siblings, 2 replies; 46+ messages in thread
From: Alex Williamson @ 2019-11-12 22:30 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 12 Nov 2019 22:33:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> - Used 3 bits to define VFIO device states.
>     Bit 0 => _RUNNING
>     Bit 1 => _SAVING
>     Bit 2 => _RESUMING
>     Combination of these bits defines VFIO device's state during migration
>     _RUNNING => Normal VFIO device running state. When its reset, it
> 		indicates _STOPPED state. when device is changed to
> 		_STOPPED, driver should stop device before write()
> 		returns.
>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>                           start saving state of device i.e. pre-copy state
>     _SAVING  => vCPUs are stopped, VFIO device should be stopped, and

s/should/must/

>                 save device state,i.e. stop-n-copy state
>     _RESUMING => VFIO device resuming state.
>     _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states

A table might be useful here and in the uapi header to indicate valid
states:

| _RESUMING | _SAVING | _RUNNING | Description
+-----------+---------+----------+------------------------------------------
|     0     |    0    |     0    | Stopped, not saving or resuming (a)
+-----------+---------+----------+------------------------------------------
|     0     |    0    |     1    | Running, default state
+-----------+---------+----------+------------------------------------------
|     0     |    1    |     0    | Stopped, migration interface in save mode
+-----------+---------+----------+------------------------------------------
|     0     |    1    |     1    | Running, save mode interface, iterative
+-----------+---------+----------+------------------------------------------
|     1     |    0    |     0    | Stopped, migration resume interface active
+-----------+---------+----------+------------------------------------------
|     1     |    0    |     1    | Invalid (b)
+-----------+---------+----------+------------------------------------------
|     1     |    1    |     0    | Invalid (c)
+-----------+---------+----------+------------------------------------------
|     1     |    1    |     1    | Invalid (d)

I think we need to consider whether we define (a) as generally
available, for instance we might want to use it for diagnostics or a
fatal error condition outside of migration.

Are there hidden assumptions between state transitions here or are
there specific next possible state diagrams that we need to include as
well?

I'm curious if Intel agrees with the states marked invalid with their
push for post-copy support.

>     Bits 3 - 31 are reserved for future use. User should perform
>     read-modify-write operation on this field.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access:
>     * device_state: (read/write)
>         To convey VFIO device state to be transitioned to. Only 3 bits are
> 	used as of now, Bits 3 - 31 are reserved for future use.
>     * pending bytes: (read only)
>         To get pending bytes yet to be migrated for VFIO device.
>     * data_offset: (read only)
>         To get data offset in migration region from where data exist
> 	during _SAVING and from where data should be written by user space
> 	application during _RESUMING state.
>     * data_size: (read/write)
>         To get and set size in bytes of data copied in migration region
> 	during _SAVING and _RESUMING state.
> 
> Migration region looks like:
>  ------------------------------------------------------------------
> |vfio_device_migration_info|    data section                      |
> |                          |     ///////////////////////////////  |
>  ------------------------------------------------------------------
>  ^                              ^
>  offset 0-trapped part        data_offset
> 
> Structure vfio_device_migration_info is always followed by data section
> in the region, so data_offset will always be non-0. Offset from where data
> to be copied is decided by kernel driver, data section can be trapped or
> mapped depending on how kernel driver defines data section.
> Data section partition can be defined as mapped by sparse mmap capability.
> If mmapped, then data_offset should be page aligned, where as initial
> section which contain vfio_device_migration_info structure might not end
> at offset which is page aligned.
> Vendor driver should decide whether to partition data section and how to
> partition the data section. Vendor driver should return data_offset
> accordingly.
> 
> For user application, data is opaque. User should write data in the same
> order as received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 108 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a147ead..35b09427ad9f 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_TYPE_CCW			(2)
> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>  
>  /* sub-types for VFIO_REGION_TYPE_PCI_* */
>  
> @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
>  /* sub-types for VFIO_REGION_TYPE_CCW */
>  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>  
> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> +
> +/*
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      To indicate vendor driver the state VFIO device should be transitioned
> + *      to. If device state transition fails, write on this field return error.
> + *      It consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates

Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
ambiguous.

> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> + *        device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> + *        should start gathering device state information which will be provided
> + *        to VFIO user space application to save device's state.
> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> + *        prepare to resume device, data provided through migration region
> + *        should be used to resume device.
> + *      Bits 3 - 31 are reserved for future use. User should perform
> + *      read-modify-write operation on this field.
> + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> + *	Similarly _RUNNING and _RESUMING bits set is invalid state.
> + *
> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state. See below for detail of sequence to
> + *      be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      bytes in migration region during _SAVING state and write size of data
> + *      copied in bytes in migration region during _RESUMING state.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^
> + *  offset 0-trapped part        data_offset
> + *
> + * Structure vfio_device_migration_info is always followed by data section in
> + * the region, so data_offset will always be non-0. Offset from where data is
> + * copied is decided by kernel driver, data section can be trapped or mapped
> + * or partitioned, depending on how kernel driver defines data section.
> + * Data section partition can be defined as mapped by sparse mmap capability.
> + * If mmapped, then data_offset should be page aligned, where as initial section
> + * which contain vfio_device_migration_info structure might not end at offset
> + * which is page aligned.

"The user is not required to to access via mmap regardless of the
region mmap capabilities."

> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.
> + *
> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> + * and for _SAVING device state or stop-and-copy phase:
> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> + *    Kernel driver should return this read operation only after writing data to
> + *    staging buffer is done.

"staging buffer" implies a vendor driver implementation, perhaps we
could just state that data is available from (region + data_offset) to
(region + data_offset + data_size) upon return of this read operation.

> + * c. read data_size, amount of data in bytes written by vendor driver in
> + *    migration region.
> + * d. read data_size bytes of data from data_offset in the migration region.
> + * e. process data.
> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> + *    operation from migration region for previous iteration is done.

I think this indicate that step (f) should be to read pending_bytes, the
read sequence is not complete until this step.  Optionally the user can
then proceed to step (b).  There are no read side-effects of (a) afaict.

Is the use required to reach pending_bytes == 0 before changing
device_state, particularly transitioning to !_RUNNING?  Presumably the
user can exit this sequence at any time by clearing _SAVING.

> + *
> + * Sequence to be followed while _RESUMING device state:
> + * While data for this device is available, repeat below steps:
> + * a. read data_offset from where user application should write data.
> + * b. write data of data_size to migration region from data_offset.
> + * c. write data_size which indicates vendor driver that data is written in
> + *    staging buffer. Vendor driver should read this data from migration
> + *    region and resume device's state.

The device defaults to _RUNNING state, so a prerequisite is to set
_RESUMING and clear _RUNNING, right?

> + *
> + * For user application, data is opaque. User should write data in the same
> + * order as received.
> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +				     VFIO_DEVICE_STATE_SAVING |  \
> +				     VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
> +					    VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
> +					    VFIO_DEVICE_STATE_RESUMING)

These seem difficult to use, maybe we just need a
VFIO_DEVICE_STATE_VALID macro?

#define VFIO_DEVICE_STATE_VALID(state) \
  (state & VFIO_DEVICE_STATE_RESUMING ? \
  (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)

Thanks,
Alex

> +	__u32 reserved;
> +	__u64 pending_bytes;
> +	__u64 data_offset;
> +	__u64 data_size;
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 5/5] vfio iommu: Implementation of ioctl to get dirty bitmap before unmap
  2019-11-12 17:03 ` [PATCH v9 Kernel 5/5] vfio iommu: Implementation of ioctl to get dirty bitmap before unmap Kirti Wankhede
@ 2019-11-12 22:30   ` Alex Williamson
  0 siblings, 0 replies; 46+ messages in thread
From: Alex Williamson @ 2019-11-12 22:30 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 12 Nov 2019 22:33:40 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> If pages are pinned by external interface for requested IO virtual address
> range, bitmap of such pages is created and then that range is unmapped.
> To get bitmap during unmap, user should set flag
> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
> bitmap_size should be set. If flag is not set, then it behaves same as
> VFIO_IOMMU_UNMAP_DMA ioctl.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 71 +++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 69 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index ac176e672857..d6b988452ba6 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -926,7 +926,8 @@ static int vfio_iova_get_dirty_bitmap(struct vfio_iommu *iommu,
>  }
>  
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> -			     struct vfio_iommu_type1_dma_unmap *unmap)
> +			     struct vfio_iommu_type1_dma_unmap *unmap,
> +			     unsigned long *bitmap)
>  {
>  	uint64_t mask;
>  	struct vfio_dma *dma, *dma_last = NULL;
> @@ -1026,6 +1027,12 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  						    &nb_unmap);
>  			goto again;
>  		}
> +
> +		if (bitmap) {
> +			vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size,
> +					       unmap->iova, bitmap);
> +		}
> +
>  		unmapped += dma->size;
>  		vfio_remove_dma(iommu, dma);
>  	}
> @@ -1039,6 +1046,43 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_dma_do_unmap_bitmap(struct vfio_iommu *iommu,
> +		struct vfio_iommu_type1_dma_unmap_bitmap *unmap_bitmap)
> +{
> +	struct vfio_iommu_type1_dma_unmap unmap;
> +	unsigned long *bitmap = NULL;
> +	int ret;
> +
> +	/* check bitmap size */
> +	if ((unmap_bitmap->flags | VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)) {

It's required to enforce other flag bits are zero or else we can never
guarantee we can use them in the future without breaking existing
userspace, but I'd really rather extend the existing ioctl.

Should we provide any optimization to indicate to the user that dirty
bits were set?  Thanks,

Alex

> +		ret = verify_bitmap_size(unmap_bitmap->size >> PAGE_SHIFT,
> +					 unmap_bitmap->bitmap_size);
> +		if (ret)
> +			return ret;
> +
> +		/* one bit per page */
> +		bitmap = bitmap_zalloc(unmap_bitmap->size >> PAGE_SHIFT,
> +					GFP_KERNEL);
> +		if (!bitmap)
> +			return -ENOMEM;
> +	}
> +
> +	unmap.iova = unmap_bitmap->iova;
> +	unmap.size = unmap_bitmap->size;
> +	ret = vfio_dma_do_unmap(iommu, &unmap, bitmap);
> +	if (!ret)
> +		unmap_bitmap->size = unmap.size;
> +
> +	if (bitmap) {
> +		if (!ret && copy_to_user(unmap_bitmap->bitmap, bitmap,
> +					 unmap_bitmap->bitmap_size))
> +			ret = -EFAULT;
> +		bitmap_free(bitmap);
> +	}
> +
> +	return ret;
> +}
> +
>  static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
>  			  unsigned long pfn, long npage, int prot)
>  {
> @@ -2366,7 +2410,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		if (unmap.argsz < minsz || unmap.flags)
>  			return -EINVAL;
>  
> -		ret = vfio_dma_do_unmap(iommu, &unmap);
> +		ret = vfio_dma_do_unmap(iommu, &unmap, NULL);
>  		if (ret)
>  			return ret;
>  
> @@ -2389,6 +2433,29 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  			return -EINVAL;
>  
>  		return vfio_iova_get_dirty_bitmap(iommu, &range);
> +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA_GET_BITMAP) {
> +		struct vfio_iommu_type1_dma_unmap_bitmap unmap_bitmap;
> +		long ret;
> +
> +		/* Supported for v2 version only */
> +		if (!iommu->v2)
> +			return -EACCES;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap_bitmap,
> +				    bitmap);
> +
> +		if (copy_from_user(&unmap_bitmap, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (unmap_bitmap.argsz < minsz)
> +			return -EINVAL;
> +
> +		ret = vfio_dma_do_unmap_bitmap(iommu, &unmap_bitmap);
> +		if (ret)
> +			return ret;
> +
> +		return copy_to_user((void __user *)arg, &unmap_bitmap, minsz) ?
> +			-EFAULT : 0;
>  	}
>  
>  	return -ENOTTY;


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 4/5] vfio iommu: Implementation of ioctl to get dirty pages bitmap.
  2019-11-12 17:03 ` [PATCH v9 Kernel 4/5] vfio iommu: Implementation of ioctl to get dirty pages bitmap Kirti Wankhede
@ 2019-11-12 22:30   ` Alex Williamson
  0 siblings, 0 replies; 46+ messages in thread
From: Alex Williamson @ 2019-11-12 22:30 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 12 Nov 2019 22:33:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> IOMMU container maintains list of external pinned pages. Bitmap of pinned
> pages for input IO virtual address range is created and returned.
> IO virtual address range should be from a single mapping created by
> map request. Input bitmap_size is validated by calculating the size of
> requested range.
> This ioctl returns bitmap of dirty pages, its user space application
> responsibility to copy content of dirty pages from source to destination
> during migration.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 92 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 92 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ada8e6cdb88..ac176e672857 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -850,6 +850,81 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	return bitmap;
>  }
>  
> +/*
> + * start_iova is the reference from where bitmaping started. This is called
> + * from DMA_UNMAP where start_iova can be different than iova

Why not simply call this with a pointer to the bitmap relative to the
start of the iova?

> + */
> +
> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> +				  size_t size, dma_addr_t start_iova,
> +				  unsigned long *bitmap)
> +{
> +	struct vfio_dma *dma;
> +	dma_addr_t temp_iova = iova;
> +
> +	dma = vfio_find_dma(iommu, iova, size);
> +	if (!dma)

The UAPI did not define that the user can only ask for the dirty bitmap
across a mapped range.

> +		return -EINVAL;
> +
> +	/*
> +	 * Range should be from a single mapping created by map request.
> +	 */

The UAPI also did not specify this as a requirement.

> +
> +	if ((iova < dma->iova) ||
> +	    ((dma->iova + dma->size) < (iova + size)))
> +		return -EINVAL;

Nor this.

So the actual implemented UAPI is that the user must call this over
some portion of, but not exceeding a single previously mapped DMA
range.  Why so restrictive?

> +
> +	while (temp_iova < iova + size) {
> +		struct vfio_pfn *vpfn = NULL;
> +
> +		vpfn = vfio_find_vpfn(dma, temp_iova);
> +		if (vpfn)
> +			__bitmap_set(bitmap, vpfn->iova - start_iova, 1);
> +
> +		temp_iova += PAGE_SIZE;

Seems like waking the rb tree would be far more efficient.  Also, if
dma->iommu_mapped, mark all pages dirty until we figure out how to
avoid it.

> +	}
> +
> +	return 0;
> +}
> +
> +static int verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> +{
> +	unsigned long bsize = ALIGN(npages, BITS_PER_LONG) / 8;
> +
> +	if ((bitmap_size == 0) || (bitmap_size < bsize))
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static int vfio_iova_get_dirty_bitmap(struct vfio_iommu *iommu,
> +				struct vfio_iommu_type1_dirty_bitmap *range)
> +{
> +	unsigned long *bitmap;
> +	int ret;
> +
> +	ret = verify_bitmap_size(range->size >> PAGE_SHIFT, range->bitmap_size);
> +	if (ret)
> +		return ret;
> +
> +	/* one bit per page */
> +	bitmap = bitmap_zalloc(range->size >> PAGE_SHIFT, GFP_KERNEL);

This creates a DoS vector, we need to be able to directly use the user
bitmap or chunk words into it using a confined size (ex. a user can
with args 0 to UIN64_MAX). Thanks,

Alex

> +	if (!bitmap)
> +		return -ENOMEM;
> +
> +	mutex_lock(&iommu->lock);
> +	ret = vfio_iova_dirty_bitmap(iommu, range->iova, range->size,
> +				     range->iova, bitmap);
> +	mutex_unlock(&iommu->lock);
> +
> +	if (!ret) {
> +		if (copy_to_user(range->bitmap, bitmap, range->bitmap_size))
> +			ret = -EFAULT;
> +	}
> +
> +	bitmap_free(bitmap);
> +	return ret;
> +}
> +
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  			     struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> @@ -2297,6 +2372,23 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +	} else if (cmd == VFIO_IOMMU_GET_DIRTY_BITMAP) {
> +		struct vfio_iommu_type1_dirty_bitmap range;
> +
> +		/* Supported for v2 version only */
> +		if (!iommu->v2)
> +			return -EACCES;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> +					bitmap);
> +
> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (range.argsz < minsz)
> +			return -EINVAL;
> +
> +		return vfio_iova_get_dirty_bitmap(iommu, &range);
>  	}
>  
>  	return -ENOTTY;


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap
  2019-11-12 17:03 ` [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap Kirti Wankhede
@ 2019-11-12 22:30   ` Alex Williamson
  2019-11-13 19:52     ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-11-12 22:30 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 12 Nov 2019 22:33:38 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> With vIOMMU, during pre-copy phase of migration, while CPUs are still
> running, IO virtual address unmap can happen while device still keeping
> reference of guest pfns. Those pages should be reported as dirty before
> unmap, so that VFIO user space application can copy content of those pages
> from source to destination.
> 
> IOCTL defination added here add bitmap pointer, size and flag. If flag

definition, adds

> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set and bitmap memory is allocated
> and bitmap_size of set, then ioctl will create bitmap of pinned pages and

s/of/is/

> then unmap those.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 6fd3822aa610..72fd297baf52 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -925,6 +925,39 @@ struct vfio_iommu_type1_dirty_bitmap {
>  
>  #define VFIO_IOMMU_GET_DIRTY_BITMAP             _IO(VFIO_TYPE, VFIO_BASE + 17)
>  
> +/**
> + * VFIO_IOMMU_UNMAP_DMA_GET_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
> + *				      struct vfio_iommu_type1_dma_unmap_bitmap)
> + *
> + * Unmap IO virtual addresses using the provided struct
> + * vfio_iommu_type1_dma_unmap_bitmap.  Caller sets argsz.
> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
> + * before unmapping IO virtual addresses. If this flag is not set, only IO
> + * virtual address are unmapped without creating pinned pages bitmap, that
> + * is, behave same as VFIO_IOMMU_UNMAP_DMA ioctl.
> + * User should allocate memory to get bitmap and should set size of allocated
> + * memory in bitmap_size field. One bit in bitmap is used to represent per page
> + * consecutively starting from iova offset. Bit set indicates page at that
> + * offset from iova is dirty.
> + * The actual unmapped size is returned in the size field and bitmap of pages
> + * in the range of unmapped size is returned in bitmap if flag
> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set.
> + *
> + * No guarantee is made to the user that arbitrary unmaps of iova or size
> + * different from those used in the original mapping call will succeed.
> + */
> +struct vfio_iommu_type1_dma_unmap_bitmap {
> +	__u32        argsz;
> +	__u32        flags;
> +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> +	__u64        iova;                        /* IO virtual address */
> +	__u64        size;                        /* Size of mapping (bytes) */
> +	__u64        bitmap_size;                 /* in bytes */
> +	void __user *bitmap;                      /* one bit per page */
> +};
> +
> +#define VFIO_IOMMU_UNMAP_DMA_GET_BITMAP _IO(VFIO_TYPE, VFIO_BASE + 18)
> +

Why not extend VFIO_IOMMU_UNMAP_DMA to support this rather than add an
ioctl that duplicates the functionality and extends it??  Otherwise
same comments as previous, in fact it's too bad we can't use this ioctl
for both, but a DONT_UNMAP flag on the UNMAP_DMA ioctl seems a bit
absurd.

I suspect we also want a flags bit in VFIO_IOMMU_GET_INFO to indicate
these capabilities are supported.

Maybe for both ioctls we also want to define it as the user's
responsibility to zero the bitmap, requiring the kernel to only set
bits as necessary.  Thanks,

Alex

>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-12 17:03 ` [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap Kirti Wankhede
@ 2019-11-12 22:30   ` Alex Williamson
  2019-11-13 19:37     ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-11-12 22:30 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 12 Nov 2019 22:33:37 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> All pages pinned by vendor driver through vfio_pin_pages API should be
> considered as dirty during migration. IOMMU container maintains a list of
> all such pinned pages. Added an ioctl defination to get bitmap of such

definition

> pinned pages for requested IO virtual address range.

Additionally, all mapped pages are considered dirty when physically
mapped through to an IOMMU, modulo we discussed devices opting in to
per page pinning to indicate finer granularity with a TBD mechanism to
figure out if any non-opt-in devices remain.

> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 35b09427ad9f..6fd3822aa610 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/**
> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
> + *                                     struct vfio_iommu_type1_dirty_bitmap)
> + *
> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
> + * Get dirty pages bitmap of given IO virtual addresses range using
> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
> + * bitmap and should set size of allocated memory in bitmap_size field.
> + * One bit is used to represent per page consecutively starting from iova
> + * offset. Bit set indicates page at that offset from iova is dirty.
> + */
> +struct vfio_iommu_type1_dirty_bitmap {
> +	__u32        argsz;
> +	__u32        flags;
> +	__u64        iova;                      /* IO virtual address */
> +	__u64        size;                      /* Size of iova range */
> +	__u64        bitmap_size;               /* in bytes */

This seems redundant.  We can calculate the size of the bitmap based on
the iova size.

> +	void __user *bitmap;                    /* one bit per page */

Should we define that as a __u64* to (a) help with the size
calculation, and (b) assure that we can use 8-byte ops on it?

However, who defines page size?  Is it necessarily the processor page
size?  A physical IOMMU may support page sizes other than the CPU page
size.  It might be more important to indicate the expected page size
than the bitmap size.  Thanks,

Alex

> +};
> +
> +#define VFIO_IOMMU_GET_DIRTY_BITMAP             _IO(VFIO_TYPE, VFIO_BASE + 17)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-12 22:30   ` Alex Williamson
@ 2019-11-13  3:23     ` Yan Zhao
  2019-11-13 19:02       ` Kirti Wankhede
  2019-11-13 10:24     ` Cornelia Huck
  1 sibling, 1 reply; 46+ messages in thread
From: Yan Zhao @ 2019-11-13  3:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote:
> On Tue, 12 Nov 2019 22:33:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > - Defined MIGRATION region type and sub-type.
> > - Used 3 bits to define VFIO device states.
> >     Bit 0 => _RUNNING
> >     Bit 1 => _SAVING
> >     Bit 2 => _RESUMING
> >     Combination of these bits defines VFIO device's state during migration
> >     _RUNNING => Normal VFIO device running state. When its reset, it
> > 		indicates _STOPPED state. when device is changed to
> > 		_STOPPED, driver should stop device before write()
> > 		returns.
> >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >                           start saving state of device i.e. pre-copy state
> >     _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
> 
> s/should/must/
> 
> >                 save device state,i.e. stop-n-copy state
> >     _RESUMING => VFIO device resuming state.
> >     _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
> 
> A table might be useful here and in the uapi header to indicate valid
> states:
> 
> | _RESUMING | _SAVING | _RUNNING | Description
> +-----------+---------+----------+------------------------------------------
> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> +-----------+---------+----------+------------------------------------------
> |     0     |    0    |     1    | Running, default state
> +-----------+---------+----------+------------------------------------------
> |     0     |    1    |     0    | Stopped, migration interface in save mode
> +-----------+---------+----------+------------------------------------------
> |     0     |    1    |     1    | Running, save mode interface, iterative
> +-----------+---------+----------+------------------------------------------
> |     1     |    0    |     0    | Stopped, migration resume interface active
> +-----------+---------+----------+------------------------------------------
> |     1     |    0    |     1    | Invalid (b)
> +-----------+---------+----------+------------------------------------------
> |     1     |    1    |     0    | Invalid (c)
> +-----------+---------+----------+------------------------------------------
> |     1     |    1    |     1    | Invalid (d)
> 
> I think we need to consider whether we define (a) as generally
> available, for instance we might want to use it for diagnostics or a
> fatal error condition outside of migration.
> 
> Are there hidden assumptions between state transitions here or are
> there specific next possible state diagrams that we need to include as
> well?
> 
> I'm curious if Intel agrees with the states marked invalid with their
> push for post-copy support.
> 
hi Alex and Kirti,
Actually, for postcopy, I think we anyway need an extra POSTCOPY state
introduced. Reasons as below:
- in the target side, _RSESUMING state is set in the beginning of
  migration. It cannot be used as a state to inform device of that
  currently it's in postcopy state and device DMAs are to be trapped and
  pre-faulted.
  we also cannot use state (_RESUMING + _RUNNING) as an indicator of
  postcopy, because before device & vm running in target side, some device
  state are already loaded (e.g. page tables, pending workloads),
  target side can do pre-pagefault at that period before target vm up.
- in the source side, after device is stopped, postcopy needs saving
  device state only (as compared to device state + remaining dirty
  pages in precopy). state (!_RUNNING + _SAVING) here again cannot
  differentiate precopy and postcopy here.

> >     Bits 3 - 31 are reserved for future use. User should perform
> >     read-modify-write operation on this field.
> > - Defined vfio_device_migration_info structure which will be placed at 0th
> >   offset of migration region to get/set VFIO device related information.
> >   Defined members of structure and usage on read/write access:
> >     * device_state: (read/write)
> >         To convey VFIO device state to be transitioned to. Only 3 bits are
> > 	used as of now, Bits 3 - 31 are reserved for future use.
> >     * pending bytes: (read only)
> >         To get pending bytes yet to be migrated for VFIO device.
> >     * data_offset: (read only)
> >         To get data offset in migration region from where data exist
> > 	during _SAVING and from where data should be written by user space
> > 	application during _RESUMING state.
> >     * data_size: (read/write)
> >         To get and set size in bytes of data copied in migration region
> > 	during _SAVING and _RESUMING state.
> > 
> > Migration region looks like:
> >  ------------------------------------------------------------------
> > |vfio_device_migration_info|    data section                      |
> > |                          |     ///////////////////////////////  |
> >  ------------------------------------------------------------------
> >  ^                              ^
> >  offset 0-trapped part        data_offset
> > 
> > Structure vfio_device_migration_info is always followed by data section
> > in the region, so data_offset will always be non-0. Offset from where data
> > to be copied is decided by kernel driver, data section can be trapped or
> > mapped depending on how kernel driver defines data section.
> > Data section partition can be defined as mapped by sparse mmap capability.
> > If mmapped, then data_offset should be page aligned, where as initial
> > section which contain vfio_device_migration_info structure might not end
> > at offset which is page aligned.
> > Vendor driver should decide whether to partition data section and how to
> > partition the data section. Vendor driver should return data_offset
> > accordingly.
> > 
> > For user application, data is opaque. User should write data in the same
> > order as received.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 108 insertions(+)
> > 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9e843a147ead..35b09427ad9f 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
> >  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
> >  #define VFIO_REGION_TYPE_GFX                    (1)
> >  #define VFIO_REGION_TYPE_CCW			(2)
> > +#define VFIO_REGION_TYPE_MIGRATION              (3)
> >  
> >  /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >  
> > @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
> >  /* sub-types for VFIO_REGION_TYPE_CCW */
> >  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
> >  
> > +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> > +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> > +
> > +/*
> > + * Structure vfio_device_migration_info is placed at 0th offset of
> > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> > + * information. Field accesses from this structure are only supported at their
> > + * native width and alignment, otherwise the result is undefined and vendor
> > + * drivers should return an error.
> > + *
> > + * device_state: (read/write)
> > + *      To indicate vendor driver the state VFIO device should be transitioned
> > + *      to. If device state transition fails, write on this field return error.
> > + *      It consists of 3 bits:
> > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> 
> Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
> ambiguous.
> 
> > + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> > + *        device before write() returns.
> > + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> > + *        should start gathering device state information which will be provided
> > + *        to VFIO user space application to save device's state.
> > + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> > + *        prepare to resume device, data provided through migration region
> > + *        should be used to resume device.
> > + *      Bits 3 - 31 are reserved for future use. User should perform
> > + *      read-modify-write operation on this field.
> > + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> > + *	Similarly _RUNNING and _RESUMING bits set is invalid state.
> > + *
> > + * pending bytes: (read only)
> > + *      Number of pending bytes yet to be migrated from vendor driver
> > + *
> > + * data_offset: (read only)
> > + *      User application should read data_offset in migration region from where
> > + *      user application should read device data during _SAVING state or write
> > + *      device data during _RESUMING state. See below for detail of sequence to
> > + *      be followed.
> > + *
> > + * data_size: (read/write)
> > + *      User application should read data_size to get size of data copied in
> > + *      bytes in migration region during _SAVING state and write size of data
> > + *      copied in bytes in migration region during _RESUMING state.
> > + *
> > + * Migration region looks like:
> > + *  ------------------------------------------------------------------
> > + * |vfio_device_migration_info|    data section                      |
> > + * |                          |     ///////////////////////////////  |
> > + * ------------------------------------------------------------------
> > + *   ^                              ^
> > + *  offset 0-trapped part        data_offset
> > + *
> > + * Structure vfio_device_migration_info is always followed by data section in
> > + * the region, so data_offset will always be non-0. Offset from where data is
> > + * copied is decided by kernel driver, data section can be trapped or mapped
> > + * or partitioned, depending on how kernel driver defines data section.
> > + * Data section partition can be defined as mapped by sparse mmap capability.
> > + * If mmapped, then data_offset should be page aligned, where as initial section
> > + * which contain vfio_device_migration_info structure might not end at offset
> > + * which is page aligned.
> 
> "The user is not required to to access via mmap regardless of the
> region mmap capabilities."
>
But once the user decides to access via mmap, it has to read data of
data_size each time, otherwise the vendor driver has no idea of how many
data are already read from user. Agree?

> > + * Vendor driver should decide whether to partition data section and how to
> > + * partition the data section. Vendor driver should return data_offset
> > + * accordingly.
> > + *
> > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> > + * and for _SAVING device state or stop-and-copy phase:
> > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> > + *    Kernel driver should return this read operation only after writing data to
> > + *    staging buffer is done.
May I know under what condition this data_offset will be changed per
each iteration from a-f ?

> 
> "staging buffer" implies a vendor driver implementation, perhaps we
> could just state that data is available from (region + data_offset) to
> (region + data_offset + data_size) upon return of this read operation.
> 
> > + * c. read data_size, amount of data in bytes written by vendor driver in
> > + *    migration region.
> > + * d. read data_size bytes of data from data_offset in the migration region.
> > + * e. process data.
> > + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> > + *    operation from migration region for previous iteration is done.
> 
> I think this indicate that step (f) should be to read pending_bytes, the
> read sequence is not complete until this step.  Optionally the user can
> then proceed to step (b).  There are no read side-effects of (a) afaict.
> 
> Is the use required to reach pending_bytes == 0 before changing
> device_state, particularly transitioning to !_RUNNING?  Presumably the
> user can exit this sequence at any time by clearing _SAVING.
> 
> > + *
> > + * Sequence to be followed while _RESUMING device state:
> > + * While data for this device is available, repeat below steps:
> > + * a. read data_offset from where user application should write data.
before proceed to step b, need to read data_size from vendor driver to determine
the max len of data to write. I think it's necessary in such a condition
that source vendor driver and target vendor driver do not offer data sections of
the same size. e.g. in source side, the data section is of size 100M,
but in target side, the data section is only of 50M size.
rather than simply fail, loop and write seems better.

Thanks
Yan
> > + * b. write data of data_size to migration region from data_offset.
> > + * c. write data_size which indicates vendor driver that data is written in
> > + *    staging buffer. Vendor driver should read this data from migration
> > + *    region and resume device's state.
> 
> The device defaults to _RUNNING state, so a prerequisite is to set
> _RESUMING and clear _RUNNING, right?
> 
> > + *
> > + * For user application, data is opaque. User should write data in the same
> > + * order as received.
> > + */
> > +
> > +struct vfio_device_migration_info {
> > +	__u32 device_state;         /* VFIO device state */
> > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> > +				     VFIO_DEVICE_STATE_SAVING |  \
> > +				     VFIO_DEVICE_STATE_RESUMING)
> > +
> > +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
> > +					    VFIO_DEVICE_STATE_RESUMING)
> > +
> > +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
> > +					    VFIO_DEVICE_STATE_RESUMING)
> 
> These seem difficult to use, maybe we just need a
> VFIO_DEVICE_STATE_VALID macro?
> 
> #define VFIO_DEVICE_STATE_VALID(state) \
>   (state & VFIO_DEVICE_STATE_RESUMING ? \
>   (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> 
> Thanks,
> Alex
> 
> > +	__u32 reserved;
> > +	__u64 pending_bytes;
> > +	__u64 data_offset;
> > +	__u64 data_size;
> > +} __attribute__((packed));
> > +
> >  /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >   * which allows direct access to non-MSIX registers which happened to be within
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-12 22:30   ` Alex Williamson
  2019-11-13  3:23     ` Yan Zhao
@ 2019-11-13 10:24     ` Cornelia Huck
  2019-11-13 18:27       ` Alex Williamson
  1 sibling, 1 reply; 46+ messages in thread
From: Cornelia Huck @ 2019-11-13 10:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Tue, 12 Nov 2019 15:30:05 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 12 Nov 2019 22:33:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > - Defined MIGRATION region type and sub-type.
> > - Used 3 bits to define VFIO device states.
> >     Bit 0 => _RUNNING
> >     Bit 1 => _SAVING
> >     Bit 2 => _RESUMING
> >     Combination of these bits defines VFIO device's state during migration
> >     _RUNNING => Normal VFIO device running state. When its reset, it
> > 		indicates _STOPPED state. when device is changed to
> > 		_STOPPED, driver should stop device before write()
> > 		returns.
> >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >                           start saving state of device i.e. pre-copy state
> >     _SAVING  => vCPUs are stopped, VFIO device should be stopped, and  
> 
> s/should/must/
> 
> >                 save device state,i.e. stop-n-copy state
> >     _RESUMING => VFIO device resuming state.
> >     _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states  
> 
> A table might be useful here and in the uapi header to indicate valid
> states:

I like that.

> 
> | _RESUMING | _SAVING | _RUNNING | Description
> +-----------+---------+----------+------------------------------------------
> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> +-----------+---------+----------+------------------------------------------
> |     0     |    0    |     1    | Running, default state
> +-----------+---------+----------+------------------------------------------
> |     0     |    1    |     0    | Stopped, migration interface in save mode
> +-----------+---------+----------+------------------------------------------
> |     0     |    1    |     1    | Running, save mode interface, iterative
> +-----------+---------+----------+------------------------------------------
> |     1     |    0    |     0    | Stopped, migration resume interface active
> +-----------+---------+----------+------------------------------------------
> |     1     |    0    |     1    | Invalid (b)
> +-----------+---------+----------+------------------------------------------
> |     1     |    1    |     0    | Invalid (c)
> +-----------+---------+----------+------------------------------------------
> |     1     |    1    |     1    | Invalid (d)
> 
> I think we need to consider whether we define (a) as generally
> available, for instance we might want to use it for diagnostics or a
> fatal error condition outside of migration.
> 
> Are there hidden assumptions between state transitions here or are
> there specific next possible state diagrams that we need to include as
> well?

Some kind of state-change diagram might be useful in addition to the
textual description anyway. Let me try, just to make sure I understand
this correctly:

1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1
2) 0/0/1 ---(tell driver to stop)---> 0/0/0
3) 0/1/1 ---(tell driver to stop)---> 0/1/0
4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0
5) 1/0/0 ---(driver is ready)---> 0/0/1
6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1

Not sure about the usefulness of 2). Also, is 4) the only way to
trigger resuming? And is the change in 5) performed by the driver, or
by userspace?

Are any other state transitions valid?

(...)

> > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> > + * and for _SAVING device state or stop-and-copy phase:
> > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> > + *    Kernel driver should return this read operation only after writing data to
> > + *    staging buffer is done.  
> 
> "staging buffer" implies a vendor driver implementation, perhaps we
> could just state that data is available from (region + data_offset) to
> (region + data_offset + data_size) upon return of this read operation.
> 
> > + * c. read data_size, amount of data in bytes written by vendor driver in
> > + *    migration region.
> > + * d. read data_size bytes of data from data_offset in the migration region.
> > + * e. process data.
> > + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> > + *    operation from migration region for previous iteration is done.  
> 
> I think this indicate that step (f) should be to read pending_bytes, the
> read sequence is not complete until this step.  Optionally the user can
> then proceed to step (b).  There are no read side-effects of (a) afaict.
> 
> Is the use required to reach pending_bytes == 0 before changing
> device_state, particularly transitioning to !_RUNNING?  Presumably the
> user can exit this sequence at any time by clearing _SAVING.

That would be transition 6) above (abort saving and continue). I think
it makes sense not to forbid this.

> 
> > + *
> > + * Sequence to be followed while _RESUMING device state:
> > + * While data for this device is available, repeat below steps:
> > + * a. read data_offset from where user application should write data.
> > + * b. write data of data_size to migration region from data_offset.
> > + * c. write data_size which indicates vendor driver that data is written in
> > + *    staging buffer. Vendor driver should read this data from migration
> > + *    region and resume device's state.  
> 
> The device defaults to _RUNNING state, so a prerequisite is to set
> _RESUMING and clear _RUNNING, right?

Transition 4) above. Do we need
7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
as well? (Probably depends on how sensible the 0/0/0 state is.)

> 
> > + *
> > + * For user application, data is opaque. User should write data in the same
> > + * order as received.
> > + */


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-13 10:24     ` Cornelia Huck
@ 2019-11-13 18:27       ` Alex Williamson
  2019-11-13 19:29         ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-11-13 18:27 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Kirti Wankhede, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Wed, 13 Nov 2019 11:24:17 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 12 Nov 2019 15:30:05 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Tue, 12 Nov 2019 22:33:36 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > - Defined MIGRATION region type and sub-type.
> > > - Used 3 bits to define VFIO device states.
> > >     Bit 0 => _RUNNING
> > >     Bit 1 => _SAVING
> > >     Bit 2 => _RESUMING
> > >     Combination of these bits defines VFIO device's state during migration
> > >     _RUNNING => Normal VFIO device running state. When its reset, it
> > > 		indicates _STOPPED state. when device is changed to
> > > 		_STOPPED, driver should stop device before write()
> > > 		returns.
> > >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> > >                           start saving state of device i.e. pre-copy state
> > >     _SAVING  => vCPUs are stopped, VFIO device should be stopped, and    
> > 
> > s/should/must/
> >   
> > >                 save device state,i.e. stop-n-copy state
> > >     _RESUMING => VFIO device resuming state.
> > >     _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states    
> > 
> > A table might be useful here and in the uapi header to indicate valid
> > states:  
> 
> I like that.
> 
> > 
> > | _RESUMING | _SAVING | _RUNNING | Description
> > +-----------+---------+----------+------------------------------------------
> > |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> > +-----------+---------+----------+------------------------------------------
> > |     0     |    0    |     1    | Running, default state
> > +-----------+---------+----------+------------------------------------------
> > |     0     |    1    |     0    | Stopped, migration interface in save mode
> > +-----------+---------+----------+------------------------------------------
> > |     0     |    1    |     1    | Running, save mode interface, iterative
> > +-----------+---------+----------+------------------------------------------
> > |     1     |    0    |     0    | Stopped, migration resume interface active
> > +-----------+---------+----------+------------------------------------------
> > |     1     |    0    |     1    | Invalid (b)
> > +-----------+---------+----------+------------------------------------------
> > |     1     |    1    |     0    | Invalid (c)
> > +-----------+---------+----------+------------------------------------------
> > |     1     |    1    |     1    | Invalid (d)
> > 
> > I think we need to consider whether we define (a) as generally
> > available, for instance we might want to use it for diagnostics or a
> > fatal error condition outside of migration.
> > 
> > Are there hidden assumptions between state transitions here or are
> > there specific next possible state diagrams that we need to include as
> > well?  
> 
> Some kind of state-change diagram might be useful in addition to the
> textual description anyway. Let me try, just to make sure I understand
> this correctly:
> 
> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1
> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0
> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0
> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0

I think this is to switch into resuming mode, the data will follow

> 5) 1/0/0 ---(driver is ready)---> 0/0/1
> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1

I think also:

0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy

0/0/0 and 0/0/1 should be reachable from any state, though I could see
that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
received state is incomplete.  Somehow though a user always needs to
return the device to the initial state, so how does device_state
interact with the reset ioctl?  Would this automatically manipulate
device_state back to 0/0/1?
 
> Not sure about the usefulness of 2). Also, is 4) the only way to
> trigger resuming? And is the change in 5) performed by the driver, or
> by userspace?
> 
> Are any other state transitions valid?
> 
> (...)
> 
> > > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> > > + * and for _SAVING device state or stop-and-copy phase:
> > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > > + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> > > + *    Kernel driver should return this read operation only after writing data to
> > > + *    staging buffer is done.    
> > 
> > "staging buffer" implies a vendor driver implementation, perhaps we
> > could just state that data is available from (region + data_offset) to
> > (region + data_offset + data_size) upon return of this read operation.
> >   
> > > + * c. read data_size, amount of data in bytes written by vendor driver in
> > > + *    migration region.
> > > + * d. read data_size bytes of data from data_offset in the migration region.
> > > + * e. process data.
> > > + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> > > + *    operation from migration region for previous iteration is done.    
> > 
> > I think this indicate that step (f) should be to read pending_bytes, the
> > read sequence is not complete until this step.  Optionally the user can
> > then proceed to step (b).  There are no read side-effects of (a) afaict.
> > 
> > Is the use required to reach pending_bytes == 0 before changing
> > device_state, particularly transitioning to !_RUNNING?  Presumably the
> > user can exit this sequence at any time by clearing _SAVING.  
> 
> That would be transition 6) above (abort saving and continue). I think
> it makes sense not to forbid this.
> 
> >   
> > > + *
> > > + * Sequence to be followed while _RESUMING device state:
> > > + * While data for this device is available, repeat below steps:
> > > + * a. read data_offset from where user application should write data.
> > > + * b. write data of data_size to migration region from data_offset.
> > > + * c. write data_size which indicates vendor driver that data is written in
> > > + *    staging buffer. Vendor driver should read this data from migration
> > > + *    region and resume device's state.    
> > 
> > The device defaults to _RUNNING state, so a prerequisite is to set
> > _RESUMING and clear _RUNNING, right?  
> 
> Transition 4) above. Do we need
> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
> as well? (Probably depends on how sensible the 0/0/0 state is.)

I think we must unless we require the user to transition from 0/0/1 to
1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
available.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-13  3:23     ` Yan Zhao
@ 2019-11-13 19:02       ` Kirti Wankhede
  2019-11-14  0:36         ` Yan Zhao
  0 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-13 19:02 UTC (permalink / raw)
  To: Yan Zhao, Alex Williamson
  Cc: cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu, Yi L,
	mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang,
	Zhi A, qemu-devel, kvm



On 11/13/2019 8:53 AM, Yan Zhao wrote:
> On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote:
>> On Tue, 12 Nov 2019 22:33:36 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>>> - Defined MIGRATION region type and sub-type.
>>> - Used 3 bits to define VFIO device states.
>>>      Bit 0 => _RUNNING
>>>      Bit 1 => _SAVING
>>>      Bit 2 => _RESUMING
>>>      Combination of these bits defines VFIO device's state during migration
>>>      _RUNNING => Normal VFIO device running state. When its reset, it
>>> 		indicates _STOPPED state. when device is changed to
>>> 		_STOPPED, driver should stop device before write()
>>> 		returns.
>>>      _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>>>                            start saving state of device i.e. pre-copy state
>>>      _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
>>
>> s/should/must/
>>
>>>                  save device state,i.e. stop-n-copy state
>>>      _RESUMING => VFIO device resuming state.
>>>      _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
>>
>> A table might be useful here and in the uapi header to indicate valid
>> states:
>>
>> | _RESUMING | _SAVING | _RUNNING | Description
>> +-----------+---------+----------+------------------------------------------
>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
>> +-----------+---------+----------+------------------------------------------
>> |     0     |    0    |     1    | Running, default state
>> +-----------+---------+----------+------------------------------------------
>> |     0     |    1    |     0    | Stopped, migration interface in save mode
>> +-----------+---------+----------+------------------------------------------
>> |     0     |    1    |     1    | Running, save mode interface, iterative
>> +-----------+---------+----------+------------------------------------------
>> |     1     |    0    |     0    | Stopped, migration resume interface active
>> +-----------+---------+----------+------------------------------------------
>> |     1     |    0    |     1    | Invalid (b)
>> +-----------+---------+----------+------------------------------------------
>> |     1     |    1    |     0    | Invalid (c)
>> +-----------+---------+----------+------------------------------------------
>> |     1     |    1    |     1    | Invalid (d)
>>
>> I think we need to consider whether we define (a) as generally
>> available, for instance we might want to use it for diagnostics or a
>> fatal error condition outside of migration.
>>

We have to set it as init state. I'll add this it.

>> Are there hidden assumptions between state transitions here or are
>> there specific next possible state diagrams that we need to include as
>> well?
>>
>> I'm curious if Intel agrees with the states marked invalid with their
>> push for post-copy support.
>>
> hi Alex and Kirti,
> Actually, for postcopy, I think we anyway need an extra POSTCOPY state
> introduced. Reasons as below:
> - in the target side, _RSESUMING state is set in the beginning of
>    migration. It cannot be used as a state to inform device of that
>    currently it's in postcopy state and device DMAs are to be trapped and
>    pre-faulted.
>    we also cannot use state (_RESUMING + _RUNNING) as an indicator of
>    postcopy, because before device & vm running in target side, some device
>    state are already loaded (e.g. page tables, pending workloads),
>    target side can do pre-pagefault at that period before target vm up.
> - in the source side, after device is stopped, postcopy needs saving
>    device state only (as compared to device state + remaining dirty
>    pages in precopy). state (!_RUNNING + _SAVING) here again cannot
>    differentiate precopy and postcopy here.
> 
>>>      Bits 3 - 31 are reserved for future use. User should perform
>>>      read-modify-write operation on this field.
>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>>    offset of migration region to get/set VFIO device related information.
>>>    Defined members of structure and usage on read/write access:
>>>      * device_state: (read/write)
>>>          To convey VFIO device state to be transitioned to. Only 3 bits are
>>> 	used as of now, Bits 3 - 31 are reserved for future use.
>>>      * pending bytes: (read only)
>>>          To get pending bytes yet to be migrated for VFIO device.
>>>      * data_offset: (read only)
>>>          To get data offset in migration region from where data exist
>>> 	during _SAVING and from where data should be written by user space
>>> 	application during _RESUMING state.
>>>      * data_size: (read/write)
>>>          To get and set size in bytes of data copied in migration region
>>> 	during _SAVING and _RESUMING state.
>>>
>>> Migration region looks like:
>>>   ------------------------------------------------------------------
>>> |vfio_device_migration_info|    data section                      |
>>> |                          |     ///////////////////////////////  |
>>>   ------------------------------------------------------------------
>>>   ^                              ^
>>>   offset 0-trapped part        data_offset
>>>
>>> Structure vfio_device_migration_info is always followed by data section
>>> in the region, so data_offset will always be non-0. Offset from where data
>>> to be copied is decided by kernel driver, data section can be trapped or
>>> mapped depending on how kernel driver defines data section.
>>> Data section partition can be defined as mapped by sparse mmap capability.
>>> If mmapped, then data_offset should be page aligned, where as initial
>>> section which contain vfio_device_migration_info structure might not end
>>> at offset which is page aligned.
>>> Vendor driver should decide whether to partition data section and how to
>>> partition the data section. Vendor driver should return data_offset
>>> accordingly.
>>>
>>> For user application, data is opaque. User should write data in the same
>>> order as received.
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>> ---
>>>   include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 108 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index 9e843a147ead..35b09427ad9f 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>>>   #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>>>   #define VFIO_REGION_TYPE_GFX                    (1)
>>>   #define VFIO_REGION_TYPE_CCW			(2)
>>> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>>>   
>>>   /* sub-types for VFIO_REGION_TYPE_PCI_* */
>>>   
>>> @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
>>>   /* sub-types for VFIO_REGION_TYPE_CCW */
>>>   #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>>>   
>>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
>>> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>>> +
>>> +/*
>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>> + * information. Field accesses from this structure are only supported at their
>>> + * native width and alignment, otherwise the result is undefined and vendor
>>> + * drivers should return an error.
>>> + *
>>> + * device_state: (read/write)
>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>> + *      to. If device state transition fails, write on this field return error.
>>> + *      It consists of 3 bits:
>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>
>> Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
>> ambiguous.

Ok. Updating it.

>>
>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>>> + *        device before write() returns.
>>> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
>>> + *        should start gathering device state information which will be provided
>>> + *        to VFIO user space application to save device's state.
>>> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
>>> + *        prepare to resume device, data provided through migration region
>>> + *        should be used to resume device.
>>> + *      Bits 3 - 31 are reserved for future use. User should perform
>>> + *      read-modify-write operation on this field.
>>> + *      _SAVING and _RESUMING bits set at the same time is invalid state.
>>> + *	Similarly _RUNNING and _RESUMING bits set is invalid state.
>>> + *
>>> + * pending bytes: (read only)
>>> + *      Number of pending bytes yet to be migrated from vendor driver
>>> + *
>>> + * data_offset: (read only)
>>> + *      User application should read data_offset in migration region from where
>>> + *      user application should read device data during _SAVING state or write
>>> + *      device data during _RESUMING state. See below for detail of sequence to
>>> + *      be followed.
>>> + *
>>> + * data_size: (read/write)
>>> + *      User application should read data_size to get size of data copied in
>>> + *      bytes in migration region during _SAVING state and write size of data
>>> + *      copied in bytes in migration region during _RESUMING state.
>>> + *
>>> + * Migration region looks like:
>>> + *  ------------------------------------------------------------------
>>> + * |vfio_device_migration_info|    data section                      |
>>> + * |                          |     ///////////////////////////////  |
>>> + * ------------------------------------------------------------------
>>> + *   ^                              ^
>>> + *  offset 0-trapped part        data_offset
>>> + *
>>> + * Structure vfio_device_migration_info is always followed by data section in
>>> + * the region, so data_offset will always be non-0. Offset from where data is
>>> + * copied is decided by kernel driver, data section can be trapped or mapped
>>> + * or partitioned, depending on how kernel driver defines data section.
>>> + * Data section partition can be defined as mapped by sparse mmap capability.
>>> + * If mmapped, then data_offset should be page aligned, where as initial section
>>> + * which contain vfio_device_migration_info structure might not end at offset
>>> + * which is page aligned.
>>
>> "The user is not required to to access via mmap regardless of the
>> region mmap capabilities."
>>
> But once the user decides to access via mmap, it has to read data of
> data_size each time, otherwise the vendor driver has no idea of how many
> data are already read from user. Agree?
> 

that's right.

>>> + * Vendor driver should decide whether to partition data section and how to
>>> + * partition the data section. Vendor driver should return data_offset
>>> + * accordingly.
>>> + *
>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>> + * and for _SAVING device state or stop-and-copy phase:
>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>> + *    Kernel driver should return this read operation only after writing data to
>>> + *    staging buffer is done.
> May I know under what condition this data_offset will be changed per
> each iteration from a-f ?
> 

Its upto vendor driver, if vendor driver maintains multiple partitions 
in data section.

>>
>> "staging buffer" implies a vendor driver implementation, perhaps we
>> could just state that data is available from (region + data_offset) to
>> (region + data_offset + data_size) upon return of this read operation.
>>

Makes sense. Updating it.

>>> + * c. read data_size, amount of data in bytes written by vendor driver in
>>> + *    migration region.
>>> + * d. read data_size bytes of data from data_offset in the migration region.
>>> + * e. process data.
>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
>>> + *    operation from migration region for previous iteration is done.
>>
>> I think this indicate that step (f) should be to read pending_bytes, the
>> read sequence is not complete until this step.  Optionally the user can
>> then proceed to step (b).  There are no read side-effects of (a) afaict.
>>

I tried to reword this sequence to be more specific:

* Sequence to be followed for _SAVING|_RUNNING device state or pre-copy 
* phase and for _SAVING device state or stop-and-copy phase:
* a. read pending_bytes, indicates start of new iteration to get device 
*    data. If there was previous iteration, then this read operation
*    indicates previous iteration is done. If pending_bytes > 0, go
*    through below steps.
* b. read data_offset, indicates kernel driver to make data available
*    through data section. Kernel driver should return this read
*    operation only after data is available from (region + data_offset)
*    to (region + data_offset + data_size).
* c. read data_size, amount of data in bytes available through migration
*    region.
* d. read data of data_size bytes from (region + data_offset) from
*    migration region.
* e. process data.
* f. Loop through a to e.

Hope this is more clear.

>> Is the use required to reach pending_bytes == 0 before changing
>> device_state, particularly transitioning to !_RUNNING?

No, its not necessary to reach till pending_bytes==0 in pre-copy phase.

>>  Presumably the
>> user can exit this sequence at any time by clearing _SAVING.

In that case device state data is not complete, which will result in not 
able to resume device with that data.
In stop-and-copy phase, user should iterate till pending_bytes is 0.

>>
>>> + *
>>> + * Sequence to be followed while _RESUMING device state:
>>> + * While data for this device is available, repeat below steps:
>>> + * a. read data_offset from where user application should write data.
> before proceed to step b, need to read data_size from vendor driver to determine
> the max len of data to write. I think it's necessary in such a condition
> that source vendor driver and target vendor driver do not offer data sections of
> the same size. e.g. in source side, the data section is of size 100M,
> but in target side, the data section is only of 50M size.
> rather than simply fail, loop and write seems better.
> 

Makes sense. Doing this change for next version.

> Thanks
> Yan
>>> + * b. write data of data_size to migration region from data_offset.
>>> + * c. write data_size which indicates vendor driver that data is written in
>>> + *    staging buffer. Vendor driver should read this data from migration
>>> + *    region and resume device's state.
>>
>> The device defaults to _RUNNING state, so a prerequisite is to set
>> _RESUMING and clear _RUNNING, right?

Yes.

>>
>>> + *
>>> + * For user application, data is opaque. User should write data in the same
>>> + * order as received.
>>> + */
>>> +
>>> +struct vfio_device_migration_info {
>>> +	__u32 device_state;         /* VFIO device state */
>>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>>> +				     VFIO_DEVICE_STATE_SAVING |  \
>>> +				     VFIO_DEVICE_STATE_RESUMING)
>>> +
>>> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
>>> +					    VFIO_DEVICE_STATE_RESUMING)
>>> +
>>> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
>>> +					    VFIO_DEVICE_STATE_RESUMING)
>>
>> These seem difficult to use, maybe we just need a
>> VFIO_DEVICE_STATE_VALID macro?
>>
>> #define VFIO_DEVICE_STATE_VALID(state) \
>>    (state & VFIO_DEVICE_STATE_RESUMING ? \
>>    (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
>>

This will not be work when use of other bits gets added in future. 
That's the reason I preferred to add individual invalid states which 
user should check.

Thanks,
Kirti

>> Thanks,
>> Alex
>>
>>> +	__u32 reserved;
>>> +	__u64 pending_bytes;
>>> +	__u64 data_offset;
>>> +	__u64 data_size;
>>> +} __attribute__((packed));
>>> +
>>>   /*
>>>    * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>>    * which allows direct access to non-MSIX registers which happened to be within
>>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-13 18:27       ` Alex Williamson
@ 2019-11-13 19:29         ` Kirti Wankhede
  2019-11-13 19:48           ` Alex Williamson
  0 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-13 19:29 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, dgilbert, jonathan.davies, eauger, aik, pasic, felipe,
	Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang, yan.y.zhao,
	qemu-devel, kvm



On 11/13/2019 11:57 PM, Alex Williamson wrote:
> On Wed, 13 Nov 2019 11:24:17 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> On Tue, 12 Nov 2019 15:30:05 -0700
>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>
>>> On Tue, 12 Nov 2019 22:33:36 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> - Defined MIGRATION region type and sub-type.
>>>> - Used 3 bits to define VFIO device states.
>>>>      Bit 0 => _RUNNING
>>>>      Bit 1 => _SAVING
>>>>      Bit 2 => _RESUMING
>>>>      Combination of these bits defines VFIO device's state during migration
>>>>      _RUNNING => Normal VFIO device running state. When its reset, it
>>>> 		indicates _STOPPED state. when device is changed to
>>>> 		_STOPPED, driver should stop device before write()
>>>> 		returns.
>>>>      _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>>>>                            start saving state of device i.e. pre-copy state
>>>>      _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
>>>
>>> s/should/must/
>>>    
>>>>                  save device state,i.e. stop-n-copy state
>>>>      _RESUMING => VFIO device resuming state.
>>>>      _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
>>>
>>> A table might be useful here and in the uapi header to indicate valid
>>> states:
>>
>> I like that.
>>
>>>
>>> | _RESUMING | _SAVING | _RUNNING | Description
>>> +-----------+---------+----------+------------------------------------------
>>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
>>> +-----------+---------+----------+------------------------------------------
>>> |     0     |    0    |     1    | Running, default state
>>> +-----------+---------+----------+------------------------------------------
>>> |     0     |    1    |     0    | Stopped, migration interface in save mode
>>> +-----------+---------+----------+------------------------------------------
>>> |     0     |    1    |     1    | Running, save mode interface, iterative
>>> +-----------+---------+----------+------------------------------------------
>>> |     1     |    0    |     0    | Stopped, migration resume interface active
>>> +-----------+---------+----------+------------------------------------------
>>> |     1     |    0    |     1    | Invalid (b)
>>> +-----------+---------+----------+------------------------------------------
>>> |     1     |    1    |     0    | Invalid (c)
>>> +-----------+---------+----------+------------------------------------------
>>> |     1     |    1    |     1    | Invalid (d)
>>>
>>> I think we need to consider whether we define (a) as generally
>>> available, for instance we might want to use it for diagnostics or a
>>> fatal error condition outside of migration.
>>>
>>> Are there hidden assumptions between state transitions here or are
>>> there specific next possible state diagrams that we need to include as
>>> well?
>>
>> Some kind of state-change diagram might be useful in addition to the
>> textual description anyway. Let me try, just to make sure I understand
>> this correctly:
>>

During User application initialization, there is one more state change:

0) 0/0/0 ---- stop to running -----> 0/0/1

>> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1

not just gathering state info, but also copy device state to be 
transferred during pre-copy phase.

Below 2 state are not just to tell driver to stop, those 2 differ.
2) is device state changed from running to stop, this is when VM 
shutdowns cleanly, no need to save device state

>> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0 

>> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0

above is transition from pre-copy phase to stop-and-copy phase, where 
device data should be made available to user to transfer to destination 
or to save it to file in case of save VM or suspend.


>> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0
> 
> I think this is to switch into resuming mode, the data will follow >
>> 5) 1/0/0 ---(driver is ready)---> 0/0/1
>> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1
>

above can occur on migration cancelled or failed.


> I think also:
> 
> 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy

that's right, this happens in case of save VM or suspend VM.

> 
> 0/0/0 and 0/0/1 should be reachable from any state, though I could see
> that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
> received state is incomplete.  Somehow though a user always needs to
> return the device to the initial state, so how does device_state
> interact with the reset ioctl?  Would this automatically manipulate
> device_state back to 0/0/1?

why would reset occur on 1/0/0 -> 0/0/1 failure?

1/0/0 -> 0/0/1 fails, then user should convey that to source that 
migration has failed, then resume at source.

>   
>> Not sure about the usefulness of 2).

I explained this above.

>> Also, is 4) the only way to
>> trigger resuming? 
Yes.

>> And is the change in 5) performed by the driver, or
>> by userspace?
>>
By userspace.

>> Are any other state transitions valid?
>>
>> (...)
>>
>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>>> + *    Kernel driver should return this read operation only after writing data to
>>>> + *    staging buffer is done.
>>>
>>> "staging buffer" implies a vendor driver implementation, perhaps we
>>> could just state that data is available from (region + data_offset) to
>>> (region + data_offset + data_size) upon return of this read operation.
>>>    
>>>> + * c. read data_size, amount of data in bytes written by vendor driver in
>>>> + *    migration region.
>>>> + * d. read data_size bytes of data from data_offset in the migration region.
>>>> + * e. process data.
>>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
>>>> + *    operation from migration region for previous iteration is done.
>>>
>>> I think this indicate that step (f) should be to read pending_bytes, the
>>> read sequence is not complete until this step.  Optionally the user can
>>> then proceed to step (b).  There are no read side-effects of (a) afaict.
>>>
>>> Is the use required to reach pending_bytes == 0 before changing
>>> device_state, particularly transitioning to !_RUNNING?  Presumably the
>>> user can exit this sequence at any time by clearing _SAVING.
>>
>> That would be transition 6) above (abort saving and continue). I think
>> it makes sense not to forbid this.
>>
>>>    
>>>> + *
>>>> + * Sequence to be followed while _RESUMING device state:
>>>> + * While data for this device is available, repeat below steps:
>>>> + * a. read data_offset from where user application should write data.
>>>> + * b. write data of data_size to migration region from data_offset.
>>>> + * c. write data_size which indicates vendor driver that data is written in
>>>> + *    staging buffer. Vendor driver should read this data from migration
>>>> + *    region and resume device's state.
>>>
>>> The device defaults to _RUNNING state, so a prerequisite is to set
>>> _RESUMING and clear _RUNNING, right?
>>

Sorry, I replied yes in my previous reply, but no. Default device state 
is _STOPPED. During resume _STOPPED -> _RESUMING

>> Transition 4) above. Do we need

I think, its not required.

>> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
>> as well? (Probably depends on how sensible the 0/0/0 state is.)
> 
> I think we must unless we require the user to transition from 0/0/1 to
> 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
> available.  Thanks,
> 

its 0/0/0 -> 1/0/0 while resuming.

Thanks,
Kirti

> Alex
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-12 22:30   ` Alex Williamson
@ 2019-11-13 19:37     ` Kirti Wankhede
  2019-11-13 20:07       ` Alex Williamson
  0 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-13 19:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 11/13/2019 4:00 AM, Alex Williamson wrote:
> On Tue, 12 Nov 2019 22:33:37 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> All pages pinned by vendor driver through vfio_pin_pages API should be
>> considered as dirty during migration. IOMMU container maintains a list of
>> all such pinned pages. Added an ioctl defination to get bitmap of such
> 
> definition
> 
>> pinned pages for requested IO virtual address range.
> 
> Additionally, all mapped pages are considered dirty when physically
> mapped through to an IOMMU, modulo we discussed devices opting in to
> per page pinning to indicate finer granularity with a TBD mechanism to
> figure out if any non-opt-in devices remain.
> 

You mean, in case of device direct assignment (device pass through)?

>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
>>   1 file changed, 23 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 35b09427ad9f..6fd3822aa610 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
>>   #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>>   #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>>   
>> +/**
>> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
>> + *                                     struct vfio_iommu_type1_dirty_bitmap)
>> + *
>> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
>> + * Get dirty pages bitmap of given IO virtual addresses range using
>> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
>> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
>> + * bitmap and should set size of allocated memory in bitmap_size field.
>> + * One bit is used to represent per page consecutively starting from iova
>> + * offset. Bit set indicates page at that offset from iova is dirty.
>> + */
>> +struct vfio_iommu_type1_dirty_bitmap {
>> +	__u32        argsz;
>> +	__u32        flags;
>> +	__u64        iova;                      /* IO virtual address */
>> +	__u64        size;                      /* Size of iova range */
>> +	__u64        bitmap_size;               /* in bytes */
> 
> This seems redundant.  We can calculate the size of the bitmap based on
> the iova size.
>

But in kernel space, we need to validate the size of memory allocated by 
user instead of assuming user is always correct, right?

>> +	void __user *bitmap;                    /* one bit per page */
> 
> Should we define that as a __u64* to (a) help with the size
> calculation, and (b) assure that we can use 8-byte ops on it?
> 
> However, who defines page size?  Is it necessarily the processor page
> size?  A physical IOMMU may support page sizes other than the CPU page
> size.  It might be more important to indicate the expected page size
> than the bitmap size.  Thanks,
>

I see in QEMU and in vfio_iommu_type1 module, page sizes considered for 
mapping are CPU page size, 4K. Do we still need to have such argument?

Thanks,
Kirti

> Alex
> 
>> +};
>> +
>> +#define VFIO_IOMMU_GET_DIRTY_BITMAP             _IO(VFIO_TYPE, VFIO_BASE + 17)
>> +
>>   /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>>   
>>   /*
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-13 19:29         ` Kirti Wankhede
@ 2019-11-13 19:48           ` Alex Williamson
  2019-11-13 20:17             ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-11-13 19:48 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Cornelia Huck, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Thu, 14 Nov 2019 00:59:52 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/13/2019 11:57 PM, Alex Williamson wrote:
> > On Wed, 13 Nov 2019 11:24:17 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> >> On Tue, 12 Nov 2019 15:30:05 -0700
> >> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>  
> >>> On Tue, 12 Nov 2019 22:33:36 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> - Defined MIGRATION region type and sub-type.
> >>>> - Used 3 bits to define VFIO device states.
> >>>>      Bit 0 => _RUNNING
> >>>>      Bit 1 => _SAVING
> >>>>      Bit 2 => _RESUMING
> >>>>      Combination of these bits defines VFIO device's state during migration
> >>>>      _RUNNING => Normal VFIO device running state. When its reset, it
> >>>> 		indicates _STOPPED state. when device is changed to
> >>>> 		_STOPPED, driver should stop device before write()
> >>>> 		returns.
> >>>>      _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >>>>                            start saving state of device i.e. pre-copy state
> >>>>      _SAVING  => vCPUs are stopped, VFIO device should be stopped, and  
> >>>
> >>> s/should/must/
> >>>      
> >>>>                  save device state,i.e. stop-n-copy state
> >>>>      _RESUMING => VFIO device resuming state.
> >>>>      _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states  
> >>>
> >>> A table might be useful here and in the uapi header to indicate valid
> >>> states:  
> >>
> >> I like that.
> >>  
> >>>
> >>> | _RESUMING | _SAVING | _RUNNING | Description
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     0     |    0    |     1    | Running, default state
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     0     |    1    |     0    | Stopped, migration interface in save mode
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     0     |    1    |     1    | Running, save mode interface, iterative
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     1     |    0    |     0    | Stopped, migration resume interface active
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     1     |    0    |     1    | Invalid (b)
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     1     |    1    |     0    | Invalid (c)
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     1     |    1    |     1    | Invalid (d)
> >>>
> >>> I think we need to consider whether we define (a) as generally
> >>> available, for instance we might want to use it for diagnostics or a
> >>> fatal error condition outside of migration.
> >>>
> >>> Are there hidden assumptions between state transitions here or are
> >>> there specific next possible state diagrams that we need to include as
> >>> well?  
> >>
> >> Some kind of state-change diagram might be useful in addition to the
> >> textual description anyway. Let me try, just to make sure I understand
> >> this correctly:
> >>  
> 
> During User application initialization, there is one more state change:
> 
> 0) 0/0/0 ---- stop to running -----> 0/0/1

0/0/0 cannot be the initial state of the device, that would imply that
a device supporting this migration interface breaks backwards
compatibility with all existing vfio userspace code and that code needs
to learn to set the device running as part of its initialization.
That's absolutely unacceptable.  The initial device state must be 0/0/1.

> >> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1  
> 
> not just gathering state info, but also copy device state to be 
> transferred during pre-copy phase.
> 
> Below 2 state are not just to tell driver to stop, those 2 differ.
> 2) is device state changed from running to stop, this is when VM 
> shutdowns cleanly, no need to save device state

Userspace is under no obligation to perform this state change though,
backwards compatibility dictates this.
 
> >> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0   
> 
> >> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0  
> 
> above is transition from pre-copy phase to stop-and-copy phase, where 
> device data should be made available to user to transfer to destination 
> or to save it to file in case of save VM or suspend.
> 
> 
> >> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0  
> > 
> > I think this is to switch into resuming mode, the data will follow >  
> >> 5) 1/0/0 ---(driver is ready)---> 0/0/1
> >> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1  
> >  
> 
> above can occur on migration cancelled or failed.
> 
> 
> > I think also:
> > 
> > 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy  
> 
> that's right, this happens in case of save VM or suspend VM.
> 
> > 
> > 0/0/0 and 0/0/1 should be reachable from any state, though I could see
> > that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
> > received state is incomplete.  Somehow though a user always needs to
> > return the device to the initial state, so how does device_state
> > interact with the reset ioctl?  Would this automatically manipulate
> > device_state back to 0/0/1?  
> 
> why would reset occur on 1/0/0 -> 0/0/1 failure?

The question is whether the reset ioctl automatically puts the device
back into the initial state, 0/0/1.  A reset from 1/0/0 -> 0/0/1
presumably discards much of the device state we just restored, so
clearly that would be undesirable.
 
> 1/0/0 -> 0/0/1 fails, then user should convey that to source that 
> migration has failed, then resume at source.

In the scheme of the migration yet, but as far as the vfio interface is
concerned the user should have a path to make use of a device after
this point without closing it and starting over.  Thus, if a 1/0/0 ->
0/0/1 transition fails, would we define the device reset ioctl as a
mechanism to flush the bogus state and place the device into the 0/0/1
initial state?
 
> >     
> >> Not sure about the usefulness of 2).  
> 
> I explained this above.
> 
> >> Also, is 4) the only way to
> >> trigger resuming?   
> Yes.
> 
> >> And is the change in 5) performed by the driver, or
> >> by userspace?
> >>  
> By userspace.
> 
> >> Are any other state transitions valid?
> >>
> >> (...)
> >>  
> >>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >>>> + * and for _SAVING device state or stop-and-copy phase:
> >>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> >>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> >>>> + *    Kernel driver should return this read operation only after writing data to
> >>>> + *    staging buffer is done.  
> >>>
> >>> "staging buffer" implies a vendor driver implementation, perhaps we
> >>> could just state that data is available from (region + data_offset) to
> >>> (region + data_offset + data_size) upon return of this read operation.
> >>>      
> >>>> + * c. read data_size, amount of data in bytes written by vendor driver in
> >>>> + *    migration region.
> >>>> + * d. read data_size bytes of data from data_offset in the migration region.
> >>>> + * e. process data.
> >>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> >>>> + *    operation from migration region for previous iteration is done.  
> >>>
> >>> I think this indicate that step (f) should be to read pending_bytes, the
> >>> read sequence is not complete until this step.  Optionally the user can
> >>> then proceed to step (b).  There are no read side-effects of (a) afaict.
> >>>
> >>> Is the use required to reach pending_bytes == 0 before changing
> >>> device_state, particularly transitioning to !_RUNNING?  Presumably the
> >>> user can exit this sequence at any time by clearing _SAVING.  
> >>
> >> That would be transition 6) above (abort saving and continue). I think
> >> it makes sense not to forbid this.
> >>  
> >>>      
> >>>> + *
> >>>> + * Sequence to be followed while _RESUMING device state:
> >>>> + * While data for this device is available, repeat below steps:
> >>>> + * a. read data_offset from where user application should write data.
> >>>> + * b. write data of data_size to migration region from data_offset.
> >>>> + * c. write data_size which indicates vendor driver that data is written in
> >>>> + *    staging buffer. Vendor driver should read this data from migration
> >>>> + *    region and resume device's state.  
> >>>
> >>> The device defaults to _RUNNING state, so a prerequisite is to set
> >>> _RESUMING and clear _RUNNING, right?  
> >>  
> 
> Sorry, I replied yes in my previous reply, but no. Default device state 
> is _STOPPED. During resume _STOPPED -> _RESUMING

Nope, it can't be, it must be _RUNNING.

> >> Transition 4) above. Do we need  
> 
> I think, its not required.

But above we say it's the only way to trigger resuming (4 was 0/0/1 ->
1/0/0).

> >> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
> >> as well? (Probably depends on how sensible the 0/0/0 state is.)  
> > 
> > I think we must unless we require the user to transition from 0/0/1 to
> > 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
> > available.  Thanks,
> >   
> 
> its 0/0/0 -> 1/0/0 while resuming.

I think we're starting with different initial states, IMO there is
absolutely no way around 0/0/1 being the initial device state.
Anything otherwise means that we cannot add migration support to an
existing device and maintain compatibility with existing userspace.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap
  2019-11-12 22:30   ` Alex Williamson
@ 2019-11-13 19:52     ` Kirti Wankhede
  2019-11-13 20:22       ` Alex Williamson
  0 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-13 19:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 11/13/2019 4:00 AM, Alex Williamson wrote:
> On Tue, 12 Nov 2019 22:33:38 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> With vIOMMU, during pre-copy phase of migration, while CPUs are still
>> running, IO virtual address unmap can happen while device still keeping
>> reference of guest pfns. Those pages should be reported as dirty before
>> unmap, so that VFIO user space application can copy content of those pages
>> from source to destination.
>>
>> IOCTL defination added here add bitmap pointer, size and flag. If flag
> 
> definition, adds
> 
>> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set and bitmap memory is allocated
>> and bitmap_size of set, then ioctl will create bitmap of pinned pages and
> 
> s/of/is/
> 
>> then unmap those.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   include/uapi/linux/vfio.h | 33 +++++++++++++++++++++++++++++++++
>>   1 file changed, 33 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 6fd3822aa610..72fd297baf52 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -925,6 +925,39 @@ struct vfio_iommu_type1_dirty_bitmap {
>>   
>>   #define VFIO_IOMMU_GET_DIRTY_BITMAP             _IO(VFIO_TYPE, VFIO_BASE + 17)
>>   
>> +/**
>> + * VFIO_IOMMU_UNMAP_DMA_GET_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
>> + *				      struct vfio_iommu_type1_dma_unmap_bitmap)
>> + *
>> + * Unmap IO virtual addresses using the provided struct
>> + * vfio_iommu_type1_dma_unmap_bitmap.  Caller sets argsz.
>> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
>> + * before unmapping IO virtual addresses. If this flag is not set, only IO
>> + * virtual address are unmapped without creating pinned pages bitmap, that
>> + * is, behave same as VFIO_IOMMU_UNMAP_DMA ioctl.
>> + * User should allocate memory to get bitmap and should set size of allocated
>> + * memory in bitmap_size field. One bit in bitmap is used to represent per page
>> + * consecutively starting from iova offset. Bit set indicates page at that
>> + * offset from iova is dirty.
>> + * The actual unmapped size is returned in the size field and bitmap of pages
>> + * in the range of unmapped size is returned in bitmap if flag
>> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set.
>> + *
>> + * No guarantee is made to the user that arbitrary unmaps of iova or size
>> + * different from those used in the original mapping call will succeed.
>> + */
>> +struct vfio_iommu_type1_dma_unmap_bitmap {
>> +	__u32        argsz;
>> +	__u32        flags;
>> +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>> +	__u64        iova;                        /* IO virtual address */
>> +	__u64        size;                        /* Size of mapping (bytes) */
>> +	__u64        bitmap_size;                 /* in bytes */
>> +	void __user *bitmap;                      /* one bit per page */
>> +};
>> +
>> +#define VFIO_IOMMU_UNMAP_DMA_GET_BITMAP _IO(VFIO_TYPE, VFIO_BASE + 18)
>> +
> 
> Why not extend VFIO_IOMMU_UNMAP_DMA to support this rather than add an
> ioctl that duplicates the functionality and extends it?? 

We do want old userspace applications to work with new kernel and 
vice-versa, right?

If I try to change existing VFIO_IOMMU_UNMAP_DMA ioctl structure, say if 
add 'bitmap_size' and 'bitmap' after 'size', with below code in old 
kernel, old kernel & new userspace will work.

         minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);

         if (copy_from_user(&unmap, (void __user *)arg, minsz))
                 return -EFAULT;

         if (unmap.argsz < minsz || unmap.flags)
                 return -EINVAL;


With new kernel it would change to:
         minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, bitmap);

         if (copy_from_user(&unmap, (void __user *)arg, minsz))
                 return -EFAULT;

         if (unmap.argsz < minsz || unmap.flags)
                 return -EINVAL;

Then old userspace app will fail because unmap.argsz < minsz and might 
be copy_from_user would cause seg fault because userspace sdk doesn't 
contain new member variables.
We can't change the sequence to keep 'size' as last member, because then 
new userspace app on old kernel will interpret it wrong.

> Otherwise
> same comments as previous, in fact it's too bad we can't use this ioctl
> for both, but a DONT_UNMAP flag on the UNMAP_DMA ioctl seems a bit
> absurd.
> 
> I suspect we also want a flags bit in VFIO_IOMMU_GET_INFO to indicate
> these capabilities are supported.
> 

Ok. I'll add that.

> Maybe for both ioctls we also want to define it as the user's
> responsibility to zero the bitmap, requiring the kernel to only set
> bits as necessary. 

Ok. Updating comment.

Thanks,
Kirti

> Thanks,
> 
> Alex
> 
>>   /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>>   
>>   /*
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-13 19:37     ` Kirti Wankhede
@ 2019-11-13 20:07       ` Alex Williamson
  2019-11-14 18:56         ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-11-13 20:07 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Thu, 14 Nov 2019 01:07:21 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/13/2019 4:00 AM, Alex Williamson wrote:
> > On Tue, 12 Nov 2019 22:33:37 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> All pages pinned by vendor driver through vfio_pin_pages API should be
> >> considered as dirty during migration. IOMMU container maintains a list of
> >> all such pinned pages. Added an ioctl defination to get bitmap of such  
> > 
> > definition
> >   
> >> pinned pages for requested IO virtual address range.  
> > 
> > Additionally, all mapped pages are considered dirty when physically
> > mapped through to an IOMMU, modulo we discussed devices opting in to
> > per page pinning to indicate finer granularity with a TBD mechanism to
> > figure out if any non-opt-in devices remain.
> >   
> 
> You mean, in case of device direct assignment (device pass through)?

Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
pinned and mapped, then the correct dirty page set is all mapped pages.
We discussed using the vpfn list as a mechanism for vendor drivers to
reduce their migration footprint, but we also discussed that we would
need a way to determine that all participants in the container have
explicitly pinned their working pages or else we must consider the
entire potential working set as dirty.

> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
> >>   1 file changed, 23 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 35b09427ad9f..6fd3822aa610 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
> >>   #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> >>   #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> >>   
> >> +/**
> >> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
> >> + *                                     struct vfio_iommu_type1_dirty_bitmap)
> >> + *
> >> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
> >> + * Get dirty pages bitmap of given IO virtual addresses range using
> >> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
> >> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
> >> + * bitmap and should set size of allocated memory in bitmap_size field.
> >> + * One bit is used to represent per page consecutively starting from iova
> >> + * offset. Bit set indicates page at that offset from iova is dirty.
> >> + */
> >> +struct vfio_iommu_type1_dirty_bitmap {
> >> +	__u32        argsz;
> >> +	__u32        flags;
> >> +	__u64        iova;                      /* IO virtual address */
> >> +	__u64        size;                      /* Size of iova range */
> >> +	__u64        bitmap_size;               /* in bytes */  
> > 
> > This seems redundant.  We can calculate the size of the bitmap based on
> > the iova size.
> >  
> 
> But in kernel space, we need to validate the size of memory allocated by 
> user instead of assuming user is always correct, right?

What does it buy us for the user to tell us the size?  They could be
wrong, they could be malicious.  The argsz field on the ioctl is mostly
for the handshake that the user is competent, we should get faults from
the copy-user operation if it's incorrect.
 
> >> +	void __user *bitmap;                    /* one bit per page */  
> > 
> > Should we define that as a __u64* to (a) help with the size
> > calculation, and (b) assure that we can use 8-byte ops on it?
> > 
> > However, who defines page size?  Is it necessarily the processor page
> > size?  A physical IOMMU may support page sizes other than the CPU page
> > size.  It might be more important to indicate the expected page size
> > than the bitmap size.  Thanks,
> >  
> 
> I see in QEMU and in vfio_iommu_type1 module, page sizes considered for 
> mapping are CPU page size, 4K. Do we still need to have such argument?

That assumption exists for backwards compatibility prior to supporting
the iova_pgsizes field in vfio_iommu_type1_info.  AFAIK the current
interface has no page size assumptions and we should not add any.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-13 19:48           ` Alex Williamson
@ 2019-11-13 20:17             ` Kirti Wankhede
  2019-11-13 20:40               ` Alex Williamson
  0 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-13 20:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm



On 11/14/2019 1:18 AM, Alex Williamson wrote:
> On Thu, 14 Nov 2019 00:59:52 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/13/2019 11:57 PM, Alex Williamson wrote:
>>> On Wed, 13 Nov 2019 11:24:17 +0100
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>    
>>>> On Tue, 12 Nov 2019 15:30:05 -0700
>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>   
>>>>> On Tue, 12 Nov 2019 22:33:36 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>       
>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>       Bit 0 => _RUNNING
>>>>>>       Bit 1 => _SAVING
>>>>>>       Bit 2 => _RESUMING
>>>>>>       Combination of these bits defines VFIO device's state during migration
>>>>>>       _RUNNING => Normal VFIO device running state. When its reset, it
>>>>>> 		indicates _STOPPED state. when device is changed to
>>>>>> 		_STOPPED, driver should stop device before write()
>>>>>> 		returns.
>>>>>>       _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>>>>>>                             start saving state of device i.e. pre-copy state
>>>>>>       _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
>>>>>
>>>>> s/should/must/
>>>>>       
>>>>>>                   save device state,i.e. stop-n-copy state
>>>>>>       _RESUMING => VFIO device resuming state.
>>>>>>       _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
>>>>>
>>>>> A table might be useful here and in the uapi header to indicate valid
>>>>> states:
>>>>
>>>> I like that.
>>>>   
>>>>>
>>>>> | _RESUMING | _SAVING | _RUNNING | Description
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     0     |    0    |     1    | Running, default state
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     0     |    1    |     0    | Stopped, migration interface in save mode
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     0     |    1    |     1    | Running, save mode interface, iterative
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     1     |    0    |     0    | Stopped, migration resume interface active
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     1     |    0    |     1    | Invalid (b)
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     1     |    1    |     0    | Invalid (c)
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     1     |    1    |     1    | Invalid (d)
>>>>>
>>>>> I think we need to consider whether we define (a) as generally
>>>>> available, for instance we might want to use it for diagnostics or a
>>>>> fatal error condition outside of migration.
>>>>>
>>>>> Are there hidden assumptions between state transitions here or are
>>>>> there specific next possible state diagrams that we need to include as
>>>>> well?
>>>>
>>>> Some kind of state-change diagram might be useful in addition to the
>>>> textual description anyway. Let me try, just to make sure I understand
>>>> this correctly:
>>>>   
>>
>> During User application initialization, there is one more state change:
>>
>> 0) 0/0/0 ---- stop to running -----> 0/0/1
> 
> 0/0/0 cannot be the initial state of the device, that would imply that
> a device supporting this migration interface breaks backwards
> compatibility with all existing vfio userspace code and that code needs
> to learn to set the device running as part of its initialization.
> That's absolutely unacceptable.  The initial device state must be 0/0/1.
> 

There isn't any device state for all existing vfio userspace code right 
now. So default its assumed to be always running.

With migration support, device states are explicitly getting added. For 
example, in case of QEMU, while device is getting initialized, i.e. from 
vfio_realize(), device_state is set to 0/0/0, but not required to convey 
it to vendor driver. Then with vfio_vmstate_change() notifier, device 
state is changed to 0/0/1 when VM/vCPU are transitioned to running, at 
this moment device state is conveyed to vendor driver. So vendor driver 
doesn't see 0/0/0 state.

While resuming, for userspace, for example QEMU, device state change is 
from 0/0/0 to 1/0/0, vendor driver see 1/0/0 after device basic 
initialization is done.


>>>> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1
>>
>> not just gathering state info, but also copy device state to be
>> transferred during pre-copy phase.
>>
>> Below 2 state are not just to tell driver to stop, those 2 differ.
>> 2) is device state changed from running to stop, this is when VM
>> shutdowns cleanly, no need to save device state
> 
> Userspace is under no obligation to perform this state change though,
> backwards compatibility dictates this.
>   
>>>> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0
>>
>>>> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0
>>
>> above is transition from pre-copy phase to stop-and-copy phase, where
>> device data should be made available to user to transfer to destination
>> or to save it to file in case of save VM or suspend.
>>
>>
>>>> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0
>>>
>>> I think this is to switch into resuming mode, the data will follow >
>>>> 5) 1/0/0 ---(driver is ready)---> 0/0/1
>>>> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1
>>>   
>>
>> above can occur on migration cancelled or failed.
>>
>>
>>> I think also:
>>>
>>> 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy
>>
>> that's right, this happens in case of save VM or suspend VM.
>>
>>>
>>> 0/0/0 and 0/0/1 should be reachable from any state, though I could see
>>> that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
>>> received state is incomplete.  Somehow though a user always needs to
>>> return the device to the initial state, so how does device_state
>>> interact with the reset ioctl?  Would this automatically manipulate
>>> device_state back to 0/0/1?
>>
>> why would reset occur on 1/0/0 -> 0/0/1 failure?
> 
> The question is whether the reset ioctl automatically puts the device
> back into the initial state, 0/0/1.  A reset from 1/0/0 -> 0/0/1
> presumably discards much of the device state we just restored, so
> clearly that would be undesirable.
>   
>> 1/0/0 -> 0/0/1 fails, then user should convey that to source that
>> migration has failed, then resume at source.
> 
> In the scheme of the migration yet, but as far as the vfio interface is
> concerned the user should have a path to make use of a device after
> this point without closing it and starting over.  Thus, if a 1/0/0 ->
> 0/0/1 transition fails, would we define the device reset ioctl as a
> mechanism to flush the bogus state and place the device into the 0/0/1
> initial state?
>

Ok, userspace applications can be designed to do that. As of now with 
QEMU, I don't see a way to reset device on 1/0/0-> 0/0/1 failure.


>>>      
>>>> Not sure about the usefulness of 2).
>>
>> I explained this above.
>>
>>>> Also, is 4) the only way to
>>>> trigger resuming?
>> Yes.
>>
>>>> And is the change in 5) performed by the driver, or
>>>> by userspace?
>>>>   
>> By userspace.
>>
>>>> Are any other state transitions valid?
>>>>
>>>> (...)
>>>>   
>>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>>>>> + *    Kernel driver should return this read operation only after writing data to
>>>>>> + *    staging buffer is done.
>>>>>
>>>>> "staging buffer" implies a vendor driver implementation, perhaps we
>>>>> could just state that data is available from (region + data_offset) to
>>>>> (region + data_offset + data_size) upon return of this read operation.
>>>>>       
>>>>>> + * c. read data_size, amount of data in bytes written by vendor driver in
>>>>>> + *    migration region.
>>>>>> + * d. read data_size bytes of data from data_offset in the migration region.
>>>>>> + * e. process data.
>>>>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
>>>>>> + *    operation from migration region for previous iteration is done.
>>>>>
>>>>> I think this indicate that step (f) should be to read pending_bytes, the
>>>>> read sequence is not complete until this step.  Optionally the user can
>>>>> then proceed to step (b).  There are no read side-effects of (a) afaict.
>>>>>
>>>>> Is the use required to reach pending_bytes == 0 before changing
>>>>> device_state, particularly transitioning to !_RUNNING?  Presumably the
>>>>> user can exit this sequence at any time by clearing _SAVING.
>>>>
>>>> That would be transition 6) above (abort saving and continue). I think
>>>> it makes sense not to forbid this.
>>>>   
>>>>>       
>>>>>> + *
>>>>>> + * Sequence to be followed while _RESUMING device state:
>>>>>> + * While data for this device is available, repeat below steps:
>>>>>> + * a. read data_offset from where user application should write data.
>>>>>> + * b. write data of data_size to migration region from data_offset.
>>>>>> + * c. write data_size which indicates vendor driver that data is written in
>>>>>> + *    staging buffer. Vendor driver should read this data from migration
>>>>>> + *    region and resume device's state.
>>>>>
>>>>> The device defaults to _RUNNING state, so a prerequisite is to set
>>>>> _RESUMING and clear _RUNNING, right?
>>>>   
>>
>> Sorry, I replied yes in my previous reply, but no. Default device state
>> is _STOPPED. During resume _STOPPED -> _RESUMING
> 
> Nope, it can't be, it must be _RUNNING.
> 
>>>> Transition 4) above. Do we need
>>
>> I think, its not required.
> 
> But above we say it's the only way to trigger resuming (4 was 0/0/1 ->
> 1/0/0).
> 
>>>> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
>>>> as well? (Probably depends on how sensible the 0/0/0 state is.)
>>>
>>> I think we must unless we require the user to transition from 0/0/1 to
>>> 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
>>> available.  Thanks,
>>>    
>>
>> its 0/0/0 -> 1/0/0 while resuming.
> 
> I think we're starting with different initial states, IMO there is
> absolutely no way around 0/0/1 being the initial device state.
> Anything otherwise means that we cannot add migration support to an
> existing device and maintain compatibility with existing userspace.
> Thanks,
> 
Hope above explanation helps to resolve this concern.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap
  2019-11-13 19:52     ` Kirti Wankhede
@ 2019-11-13 20:22       ` Alex Williamson
  2019-11-14 18:56         ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-11-13 20:22 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Thu, 14 Nov 2019 01:22:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/13/2019 4:00 AM, Alex Williamson wrote:
> > On Tue, 12 Nov 2019 22:33:38 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> With vIOMMU, during pre-copy phase of migration, while CPUs are still
> >> running, IO virtual address unmap can happen while device still keeping
> >> reference of guest pfns. Those pages should be reported as dirty before
> >> unmap, so that VFIO user space application can copy content of those pages
> >> from source to destination.
> >>
> >> IOCTL defination added here add bitmap pointer, size and flag. If flag  
> > 
> > definition, adds
> >   
> >> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set and bitmap memory is allocated
> >> and bitmap_size of set, then ioctl will create bitmap of pinned pages and  
> > 
> > s/of/is/
> >   
> >> then unmap those.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   include/uapi/linux/vfio.h | 33 +++++++++++++++++++++++++++++++++
> >>   1 file changed, 33 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 6fd3822aa610..72fd297baf52 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -925,6 +925,39 @@ struct vfio_iommu_type1_dirty_bitmap {
> >>   
> >>   #define VFIO_IOMMU_GET_DIRTY_BITMAP             _IO(VFIO_TYPE, VFIO_BASE + 17)
> >>   
> >> +/**
> >> + * VFIO_IOMMU_UNMAP_DMA_GET_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
> >> + *				      struct vfio_iommu_type1_dma_unmap_bitmap)
> >> + *
> >> + * Unmap IO virtual addresses using the provided struct
> >> + * vfio_iommu_type1_dma_unmap_bitmap.  Caller sets argsz.
> >> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
> >> + * before unmapping IO virtual addresses. If this flag is not set, only IO
> >> + * virtual address are unmapped without creating pinned pages bitmap, that
> >> + * is, behave same as VFIO_IOMMU_UNMAP_DMA ioctl.
> >> + * User should allocate memory to get bitmap and should set size of allocated
> >> + * memory in bitmap_size field. One bit in bitmap is used to represent per page
> >> + * consecutively starting from iova offset. Bit set indicates page at that
> >> + * offset from iova is dirty.
> >> + * The actual unmapped size is returned in the size field and bitmap of pages
> >> + * in the range of unmapped size is returned in bitmap if flag
> >> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set.
> >> + *
> >> + * No guarantee is made to the user that arbitrary unmaps of iova or size
> >> + * different from those used in the original mapping call will succeed.
> >> + */
> >> +struct vfio_iommu_type1_dma_unmap_bitmap {
> >> +	__u32        argsz;
> >> +	__u32        flags;
> >> +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> >> +	__u64        iova;                        /* IO virtual address */
> >> +	__u64        size;                        /* Size of mapping (bytes) */
> >> +	__u64        bitmap_size;                 /* in bytes */
> >> +	void __user *bitmap;                      /* one bit per page */
> >> +};
> >> +
> >> +#define VFIO_IOMMU_UNMAP_DMA_GET_BITMAP _IO(VFIO_TYPE, VFIO_BASE + 18)
> >> +  
> > 
> > Why not extend VFIO_IOMMU_UNMAP_DMA to support this rather than add an
> > ioctl that duplicates the functionality and extends it??   
> 
> We do want old userspace applications to work with new kernel and 
> vice-versa, right?
> 
> If I try to change existing VFIO_IOMMU_UNMAP_DMA ioctl structure, say if 
> add 'bitmap_size' and 'bitmap' after 'size', with below code in old 
> kernel, old kernel & new userspace will work.
> 
>          minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
> 
>          if (copy_from_user(&unmap, (void __user *)arg, minsz))
>                  return -EFAULT;
> 
>          if (unmap.argsz < minsz || unmap.flags)
>                  return -EINVAL;
> 
> 
> With new kernel it would change to:
>          minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, bitmap);

No, the minimum structure size still ends at size, we interpret flags
and argsz to learn if the user understands those fields and optionally
include them.  Therefore old userspace on new kernel continues to work.

>          if (copy_from_user(&unmap, (void __user *)arg, minsz))
>                  return -EFAULT;
> 
>          if (unmap.argsz < minsz || unmap.flags)
>                  return -EINVAL;
> 
> Then old userspace app will fail because unmap.argsz < minsz and might 
> be copy_from_user would cause seg fault because userspace sdk doesn't 
> contain new member variables.
> We can't change the sequence to keep 'size' as last member, because then 
> new userspace app on old kernel will interpret it wrong.

If we have new userspace on old kernel, that userspace needs to be able
to learn that this feature exists (new flag in the
vfio_iommu_type1_info struct as suggested below) and only make use of it
when available.  This is why the old kernel checks argsz against minsz.
So long as the user passes something at least minsz in size, we have
compatibility.  The old kernel doesn't understand the GET_DIRTY_BITMAP
flag and will return an error if the user attempts to use it.  Thanks,

Alex
 
> > Otherwise
> > same comments as previous, in fact it's too bad we can't use this ioctl
> > for both, but a DONT_UNMAP flag on the UNMAP_DMA ioctl seems a bit
> > absurd.
> > 
> > I suspect we also want a flags bit in VFIO_IOMMU_GET_INFO to indicate
> > these capabilities are supported.
> >   
> 
> Ok. I'll add that.
> 
> > Maybe for both ioctls we also want to define it as the user's
> > responsibility to zero the bitmap, requiring the kernel to only set
> > bits as necessary.   
> 
> Ok. Updating comment.
> 
> Thanks,
> Kirti
> 
> > Thanks,
> > 
> > Alex
> >   
> >>   /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> >>   
> >>   /*  
> >   
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-13 20:17             ` Kirti Wankhede
@ 2019-11-13 20:40               ` Alex Williamson
  2019-11-14 18:49                 ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-11-13 20:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Cornelia Huck, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Thu, 14 Nov 2019 01:47:04 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/14/2019 1:18 AM, Alex Williamson wrote:
> > On Thu, 14 Nov 2019 00:59:52 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/13/2019 11:57 PM, Alex Williamson wrote:  
> >>> On Wed, 13 Nov 2019 11:24:17 +0100
> >>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>      
> >>>> On Tue, 12 Nov 2019 15:30:05 -0700
> >>>> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>     
> >>>>> On Tue, 12 Nov 2019 22:33:36 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>         
> >>>>>> - Defined MIGRATION region type and sub-type.
> >>>>>> - Used 3 bits to define VFIO device states.
> >>>>>>       Bit 0 => _RUNNING
> >>>>>>       Bit 1 => _SAVING
> >>>>>>       Bit 2 => _RESUMING
> >>>>>>       Combination of these bits defines VFIO device's state during migration
> >>>>>>       _RUNNING => Normal VFIO device running state. When its reset, it
> >>>>>> 		indicates _STOPPED state. when device is changed to
> >>>>>> 		_STOPPED, driver should stop device before write()
> >>>>>> 		returns.
> >>>>>>       _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >>>>>>                             start saving state of device i.e. pre-copy state
> >>>>>>       _SAVING  => vCPUs are stopped, VFIO device should be stopped, and  
> >>>>>
> >>>>> s/should/must/
> >>>>>         
> >>>>>>                   save device state,i.e. stop-n-copy state
> >>>>>>       _RESUMING => VFIO device resuming state.
> >>>>>>       _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states  
> >>>>>
> >>>>> A table might be useful here and in the uapi header to indicate valid
> >>>>> states:  
> >>>>
> >>>> I like that.
> >>>>     
> >>>>>
> >>>>> | _RESUMING | _SAVING | _RUNNING | Description
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     0     |    0    |     1    | Running, default state
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     0     |    1    |     0    | Stopped, migration interface in save mode
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     0     |    1    |     1    | Running, save mode interface, iterative
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     1     |    0    |     0    | Stopped, migration resume interface active
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     1     |    0    |     1    | Invalid (b)
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     1     |    1    |     0    | Invalid (c)
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     1     |    1    |     1    | Invalid (d)
> >>>>>
> >>>>> I think we need to consider whether we define (a) as generally
> >>>>> available, for instance we might want to use it for diagnostics or a
> >>>>> fatal error condition outside of migration.
> >>>>>
> >>>>> Are there hidden assumptions between state transitions here or are
> >>>>> there specific next possible state diagrams that we need to include as
> >>>>> well?  
> >>>>
> >>>> Some kind of state-change diagram might be useful in addition to the
> >>>> textual description anyway. Let me try, just to make sure I understand
> >>>> this correctly:
> >>>>     
> >>
> >> During User application initialization, there is one more state change:
> >>
> >> 0) 0/0/0 ---- stop to running -----> 0/0/1  
> > 
> > 0/0/0 cannot be the initial state of the device, that would imply that
> > a device supporting this migration interface breaks backwards
> > compatibility with all existing vfio userspace code and that code needs
> > to learn to set the device running as part of its initialization.
> > That's absolutely unacceptable.  The initial device state must be 0/0/1.
> >   
> 
> There isn't any device state for all existing vfio userspace code right 
> now. So default its assumed to be always running.

Exactly, there is no representation of device state, therefore it's
assumed to be running, therefore when adding a representation of device
state it must default to running.

> With migration support, device states are explicitly getting added. For 
> example, in case of QEMU, while device is getting initialized, i.e. from 
> vfio_realize(), device_state is set to 0/0/0, but not required to convey 
> it to vendor driver.

But we have a 0/0/0 state, why would we intentionally keep an internal
state that's inconsistent with the device?

> Then with vfio_vmstate_change() notifier, device 
> state is changed to 0/0/1 when VM/vCPU are transitioned to running, at 
> this moment device state is conveyed to vendor driver. So vendor driver 
> doesn't see 0/0/0 state.

But the running state is the state of the device, not the VM or the
vCPU.  Sure we might want to stop the device if the VM/vCPU state is
stopped, but we must accept that the device is running when it's opened
and we shouldn't intentionally maintain inconsistent state.
 
> While resuming, for userspace, for example QEMU, device state change is 
> from 0/0/0 to 1/0/0, vendor driver see 1/0/0 after device basic 
> initialization is done.

I don't see why this matters, all device_state transitions are written
directly to the vendor driver.  The device is initially in 0/0/1 and
can be set to 1/0/0 for resuming with an optional transition through
0/0/0 and the vendor driver can see each state change.

> >>>> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1  
> >>
> >> not just gathering state info, but also copy device state to be
> >> transferred during pre-copy phase.
> >>
> >> Below 2 state are not just to tell driver to stop, those 2 differ.
> >> 2) is device state changed from running to stop, this is when VM
> >> shutdowns cleanly, no need to save device state  
> > 
> > Userspace is under no obligation to perform this state change though,
> > backwards compatibility dictates this.
> >     
> >>>> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0  
> >>  
> >>>> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0  
> >>
> >> above is transition from pre-copy phase to stop-and-copy phase, where
> >> device data should be made available to user to transfer to destination
> >> or to save it to file in case of save VM or suspend.
> >>
> >>  
> >>>> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0  
> >>>
> >>> I think this is to switch into resuming mode, the data will follow >  
> >>>> 5) 1/0/0 ---(driver is ready)---> 0/0/1
> >>>> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1  
> >>>     
> >>
> >> above can occur on migration cancelled or failed.
> >>
> >>  
> >>> I think also:
> >>>
> >>> 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy  
> >>
> >> that's right, this happens in case of save VM or suspend VM.
> >>  
> >>>
> >>> 0/0/0 and 0/0/1 should be reachable from any state, though I could see
> >>> that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
> >>> received state is incomplete.  Somehow though a user always needs to
> >>> return the device to the initial state, so how does device_state
> >>> interact with the reset ioctl?  Would this automatically manipulate
> >>> device_state back to 0/0/1?  
> >>
> >> why would reset occur on 1/0/0 -> 0/0/1 failure?  
> > 
> > The question is whether the reset ioctl automatically puts the device
> > back into the initial state, 0/0/1.  A reset from 1/0/0 -> 0/0/1
> > presumably discards much of the device state we just restored, so
> > clearly that would be undesirable.
> >     
> >> 1/0/0 -> 0/0/1 fails, then user should convey that to source that
> >> migration has failed, then resume at source.  
> > 
> > In the scheme of the migration yet, but as far as the vfio interface is
> > concerned the user should have a path to make use of a device after
> > this point without closing it and starting over.  Thus, if a 1/0/0 ->
> > 0/0/1 transition fails, would we define the device reset ioctl as a
> > mechanism to flush the bogus state and place the device into the 0/0/1
> > initial state?
> >  
> 
> Ok, userspace applications can be designed to do that. As of now with 
> QEMU, I don't see a way to reset device on 1/0/0-> 0/0/1 failure.

It's simply an ioctl, we must already have access to the device file
descriptor to perform the device_state transition.  QEMU is not
necessarily the consumer of this behavior though, if transition 1/0/0
-> 0/0/1 fails in QEMU, it very well may just exit.  The vfio API
should support a defined mechanism to recover the device from this
state though, which I propose is the existing reset ioctl, which
logically implies that any device reset returns the device_state to
0/0/1.

> >>>> Not sure about the usefulness of 2).  
> >>
> >> I explained this above.
> >>  
> >>>> Also, is 4) the only way to
> >>>> trigger resuming?  
> >> Yes.
> >>  
> >>>> And is the change in 5) performed by the driver, or
> >>>> by userspace?
> >>>>     
> >> By userspace.
> >>  
> >>>> Are any other state transitions valid?
> >>>>
> >>>> (...)
> >>>>     
> >>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >>>>>> + * and for _SAVING device state or stop-and-copy phase:
> >>>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> >>>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> >>>>>> + *    Kernel driver should return this read operation only after writing data to
> >>>>>> + *    staging buffer is done.  
> >>>>>
> >>>>> "staging buffer" implies a vendor driver implementation, perhaps we
> >>>>> could just state that data is available from (region + data_offset) to
> >>>>> (region + data_offset + data_size) upon return of this read operation.
> >>>>>         
> >>>>>> + * c. read data_size, amount of data in bytes written by vendor driver in
> >>>>>> + *    migration region.
> >>>>>> + * d. read data_size bytes of data from data_offset in the migration region.
> >>>>>> + * e. process data.
> >>>>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> >>>>>> + *    operation from migration region for previous iteration is done.  
> >>>>>
> >>>>> I think this indicate that step (f) should be to read pending_bytes, the
> >>>>> read sequence is not complete until this step.  Optionally the user can
> >>>>> then proceed to step (b).  There are no read side-effects of (a) afaict.
> >>>>>
> >>>>> Is the use required to reach pending_bytes == 0 before changing
> >>>>> device_state, particularly transitioning to !_RUNNING?  Presumably the
> >>>>> user can exit this sequence at any time by clearing _SAVING.  
> >>>>
> >>>> That would be transition 6) above (abort saving and continue). I think
> >>>> it makes sense not to forbid this.
> >>>>     
> >>>>>         
> >>>>>> + *
> >>>>>> + * Sequence to be followed while _RESUMING device state:
> >>>>>> + * While data for this device is available, repeat below steps:
> >>>>>> + * a. read data_offset from where user application should write data.
> >>>>>> + * b. write data of data_size to migration region from data_offset.
> >>>>>> + * c. write data_size which indicates vendor driver that data is written in
> >>>>>> + *    staging buffer. Vendor driver should read this data from migration
> >>>>>> + *    region and resume device's state.  
> >>>>>
> >>>>> The device defaults to _RUNNING state, so a prerequisite is to set
> >>>>> _RESUMING and clear _RUNNING, right?  
> >>>>     
> >>
> >> Sorry, I replied yes in my previous reply, but no. Default device state
> >> is _STOPPED. During resume _STOPPED -> _RESUMING  
> > 
> > Nope, it can't be, it must be _RUNNING.
> >   
> >>>> Transition 4) above. Do we need  
> >>
> >> I think, its not required.  
> > 
> > But above we say it's the only way to trigger resuming (4 was 0/0/1 ->
> > 1/0/0).
> >   
> >>>> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
> >>>> as well? (Probably depends on how sensible the 0/0/0 state is.)  
> >>>
> >>> I think we must unless we require the user to transition from 0/0/1 to
> >>> 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
> >>> available.  Thanks,
> >>>      
> >>
> >> its 0/0/0 -> 1/0/0 while resuming.  
> > 
> > I think we're starting with different initial states, IMO there is
> > absolutely no way around 0/0/1 being the initial device state.
> > Anything otherwise means that we cannot add migration support to an
> > existing device and maintain compatibility with existing userspace.
> > Thanks,
> >   
> Hope above explanation helps to resolve this concern.

Not really, I stand by that the default state must reflect previous
assumptions and therefore it must be 0/0/1 and additionally we should
not maintain state in QEMU intentionally inconsistent with the device
state.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-13 19:02       ` Kirti Wankhede
@ 2019-11-14  0:36         ` Yan Zhao
  2019-11-14 18:55           ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Yan Zhao @ 2019-11-14  0:36 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Alex Williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Thu, Nov 14, 2019 at 03:02:55AM +0800, Kirti Wankhede wrote:
> 
> 
> On 11/13/2019 8:53 AM, Yan Zhao wrote:
> > On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote:
> >> On Tue, 12 Nov 2019 22:33:36 +0530
> >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>
> >>> - Defined MIGRATION region type and sub-type.
> >>> - Used 3 bits to define VFIO device states.
> >>>      Bit 0 => _RUNNING
> >>>      Bit 1 => _SAVING
> >>>      Bit 2 => _RESUMING
> >>>      Combination of these bits defines VFIO device's state during migration
> >>>      _RUNNING => Normal VFIO device running state. When its reset, it
> >>> 		indicates _STOPPED state. when device is changed to
> >>> 		_STOPPED, driver should stop device before write()
> >>> 		returns.
> >>>      _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >>>                            start saving state of device i.e. pre-copy state
> >>>      _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
> >>
> >> s/should/must/
> >>
> >>>                  save device state,i.e. stop-n-copy state
> >>>      _RESUMING => VFIO device resuming state.
> >>>      _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
> >>
> >> A table might be useful here and in the uapi header to indicate valid
> >> states:
> >>
> >> | _RESUMING | _SAVING | _RUNNING | Description
> >> +-----------+---------+----------+------------------------------------------
> >> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> >> +-----------+---------+----------+------------------------------------------
> >> |     0     |    0    |     1    | Running, default state
> >> +-----------+---------+----------+------------------------------------------
> >> |     0     |    1    |     0    | Stopped, migration interface in save mode
> >> +-----------+---------+----------+------------------------------------------
> >> |     0     |    1    |     1    | Running, save mode interface, iterative
> >> +-----------+---------+----------+------------------------------------------
> >> |     1     |    0    |     0    | Stopped, migration resume interface active
> >> +-----------+---------+----------+------------------------------------------
> >> |     1     |    0    |     1    | Invalid (b)
> >> +-----------+---------+----------+------------------------------------------
> >> |     1     |    1    |     0    | Invalid (c)
> >> +-----------+---------+----------+------------------------------------------
> >> |     1     |    1    |     1    | Invalid (d)
> >>
> >> I think we need to consider whether we define (a) as generally
> >> available, for instance we might want to use it for diagnostics or a
> >> fatal error condition outside of migration.
> >>
> 
> We have to set it as init state. I'll add this it.
> 
> >> Are there hidden assumptions between state transitions here or are
> >> there specific next possible state diagrams that we need to include as
> >> well?
> >>
> >> I'm curious if Intel agrees with the states marked invalid with their
> >> push for post-copy support.
> >>
> > hi Alex and Kirti,
> > Actually, for postcopy, I think we anyway need an extra POSTCOPY state
> > introduced. Reasons as below:
> > - in the target side, _RSESUMING state is set in the beginning of
> >    migration. It cannot be used as a state to inform device of that
> >    currently it's in postcopy state and device DMAs are to be trapped and
> >    pre-faulted.
> >    we also cannot use state (_RESUMING + _RUNNING) as an indicator of
> >    postcopy, because before device & vm running in target side, some device
> >    state are already loaded (e.g. page tables, pending workloads),
> >    target side can do pre-pagefault at that period before target vm up.
> > - in the source side, after device is stopped, postcopy needs saving
> >    device state only (as compared to device state + remaining dirty
> >    pages in precopy). state (!_RUNNING + _SAVING) here again cannot
> >    differentiate precopy and postcopy here.
> > 
> >>>      Bits 3 - 31 are reserved for future use. User should perform
> >>>      read-modify-write operation on this field.
> >>> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>>    offset of migration region to get/set VFIO device related information.
> >>>    Defined members of structure and usage on read/write access:
> >>>      * device_state: (read/write)
> >>>          To convey VFIO device state to be transitioned to. Only 3 bits are
> >>> 	used as of now, Bits 3 - 31 are reserved for future use.
> >>>      * pending bytes: (read only)
> >>>          To get pending bytes yet to be migrated for VFIO device.
> >>>      * data_offset: (read only)
> >>>          To get data offset in migration region from where data exist
> >>> 	during _SAVING and from where data should be written by user space
> >>> 	application during _RESUMING state.
> >>>      * data_size: (read/write)
> >>>          To get and set size in bytes of data copied in migration region
> >>> 	during _SAVING and _RESUMING state.
> >>>
> >>> Migration region looks like:
> >>>   ------------------------------------------------------------------
> >>> |vfio_device_migration_info|    data section                      |
> >>> |                          |     ///////////////////////////////  |
> >>>   ------------------------------------------------------------------
> >>>   ^                              ^
> >>>   offset 0-trapped part        data_offset
> >>>
> >>> Structure vfio_device_migration_info is always followed by data section
> >>> in the region, so data_offset will always be non-0. Offset from where data
> >>> to be copied is decided by kernel driver, data section can be trapped or
> >>> mapped depending on how kernel driver defines data section.
> >>> Data section partition can be defined as mapped by sparse mmap capability.
> >>> If mmapped, then data_offset should be page aligned, where as initial
> >>> section which contain vfio_device_migration_info structure might not end
> >>> at offset which is page aligned.
> >>> Vendor driver should decide whether to partition data section and how to
> >>> partition the data section. Vendor driver should return data_offset
> >>> accordingly.
> >>>
> >>> For user application, data is opaque. User should write data in the same
> >>> order as received.
> >>>
> >>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>> ---
> >>>   include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>   1 file changed, 108 insertions(+)
> >>>
> >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>> index 9e843a147ead..35b09427ad9f 100644
> >>> --- a/include/uapi/linux/vfio.h
> >>> +++ b/include/uapi/linux/vfio.h
> >>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
> >>>   #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
> >>>   #define VFIO_REGION_TYPE_GFX                    (1)
> >>>   #define VFIO_REGION_TYPE_CCW			(2)
> >>> +#define VFIO_REGION_TYPE_MIGRATION              (3)
> >>>   
> >>>   /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >>>   
> >>> @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
> >>>   /* sub-types for VFIO_REGION_TYPE_CCW */
> >>>   #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
> >>>   
> >>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> >>> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> >>> +
> >>> +/*
> >>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >>> + * information. Field accesses from this structure are only supported at their
> >>> + * native width and alignment, otherwise the result is undefined and vendor
> >>> + * drivers should return an error.
> >>> + *
> >>> + * device_state: (read/write)
> >>> + *      To indicate vendor driver the state VFIO device should be transitioned
> >>> + *      to. If device state transition fails, write on this field return error.
> >>> + *      It consists of 3 bits:
> >>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >>
> >> Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
> >> ambiguous.
> 
> Ok. Updating it.
> 
> >>
> >>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> >>> + *        device before write() returns.
> >>> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> >>> + *        should start gathering device state information which will be provided
> >>> + *        to VFIO user space application to save device's state.
> >>> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> >>> + *        prepare to resume device, data provided through migration region
> >>> + *        should be used to resume device.
> >>> + *      Bits 3 - 31 are reserved for future use. User should perform
> >>> + *      read-modify-write operation on this field.
> >>> + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> >>> + *	Similarly _RUNNING and _RESUMING bits set is invalid state.
> >>> + *
> >>> + * pending bytes: (read only)
> >>> + *      Number of pending bytes yet to be migrated from vendor driver
> >>> + *
> >>> + * data_offset: (read only)
> >>> + *      User application should read data_offset in migration region from where
> >>> + *      user application should read device data during _SAVING state or write
> >>> + *      device data during _RESUMING state. See below for detail of sequence to
> >>> + *      be followed.
> >>> + *
> >>> + * data_size: (read/write)
> >>> + *      User application should read data_size to get size of data copied in
> >>> + *      bytes in migration region during _SAVING state and write size of data
> >>> + *      copied in bytes in migration region during _RESUMING state.
> >>> + *
> >>> + * Migration region looks like:
> >>> + *  ------------------------------------------------------------------
> >>> + * |vfio_device_migration_info|    data section                      |
> >>> + * |                          |     ///////////////////////////////  |
> >>> + * ------------------------------------------------------------------
> >>> + *   ^                              ^
> >>> + *  offset 0-trapped part        data_offset
> >>> + *
> >>> + * Structure vfio_device_migration_info is always followed by data section in
> >>> + * the region, so data_offset will always be non-0. Offset from where data is
> >>> + * copied is decided by kernel driver, data section can be trapped or mapped
> >>> + * or partitioned, depending on how kernel driver defines data section.
> >>> + * Data section partition can be defined as mapped by sparse mmap capability.
> >>> + * If mmapped, then data_offset should be page aligned, where as initial section
> >>> + * which contain vfio_device_migration_info structure might not end at offset
> >>> + * which is page aligned.
> >>
> >> "The user is not required to to access via mmap regardless of the
> >> region mmap capabilities."
> >>
> > But once the user decides to access via mmap, it has to read data of
> > data_size each time, otherwise the vendor driver has no idea of how many
> > data are already read from user. Agree?
> > 
> 
> that's right.
> 
> >>> + * Vendor driver should decide whether to partition data section and how to
> >>> + * partition the data section. Vendor driver should return data_offset
> >>> + * accordingly.
> >>> + *
> >>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >>> + * and for _SAVING device state or stop-and-copy phase:
> >>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> >>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> >>> + *    Kernel driver should return this read operation only after writing data to
> >>> + *    staging buffer is done.
> > May I know under what condition this data_offset will be changed per
> > each iteration from a-f ?
> > 
> 
> Its upto vendor driver, if vendor driver maintains multiple partitions 
> in data section.
>
So, do you mean each time before doing b (reading data_offset), step a
(reading pending_bytes) is mandatory, otherwise the vendor driver does
not know which data_offset is?
Then, any lock to wrap step a and b to ensure atomic?

Thanks
Yan


> >>
> >> "staging buffer" implies a vendor driver implementation, perhaps we
> >> could just state that data is available from (region + data_offset) to
> >> (region + data_offset + data_size) upon return of this read operation.
> >>
> 
> Makes sense. Updating it.
> 
> >>> + * c. read data_size, amount of data in bytes written by vendor driver in
> >>> + *    migration region.
> >>> + * d. read data_size bytes of data from data_offset in the migration region.
> >>> + * e. process data.
> >>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> >>> + *    operation from migration region for previous iteration is done.
> >>
> >> I think this indicate that step (f) should be to read pending_bytes, the
> >> read sequence is not complete until this step.  Optionally the user can
> >> then proceed to step (b).  There are no read side-effects of (a) afaict.
> >>
> 
> I tried to reword this sequence to be more specific:
> 
> * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy 
> * phase and for _SAVING device state or stop-and-copy phase:
> * a. read pending_bytes, indicates start of new iteration to get device 
> *    data. If there was previous iteration, then this read operation
> *    indicates previous iteration is done. If pending_bytes > 0, go
> *    through below steps.
> * b. read data_offset, indicates kernel driver to make data available
> *    through data section. Kernel driver should return this read
> *    operation only after data is available from (region + data_offset)
> *    to (region + data_offset + data_size).
> * c. read data_size, amount of data in bytes available through migration
> *    region.
> * d. read data of data_size bytes from (region + data_offset) from
> *    migration region.
> * e. process data.
> * f. Loop through a to e.
> 
> Hope this is more clear.
> 
> >> Is the use required to reach pending_bytes == 0 before changing
> >> device_state, particularly transitioning to !_RUNNING?
> 
> No, its not necessary to reach till pending_bytes==0 in pre-copy phase.
> 
> >>  Presumably the
> >> user can exit this sequence at any time by clearing _SAVING.
> 
> In that case device state data is not complete, which will result in not 
> able to resume device with that data.
> In stop-and-copy phase, user should iterate till pending_bytes is 0.
> 
> >>
> >>> + *
> >>> + * Sequence to be followed while _RESUMING device state:
> >>> + * While data for this device is available, repeat below steps:
> >>> + * a. read data_offset from where user application should write data.
> > before proceed to step b, need to read data_size from vendor driver to determine
> > the max len of data to write. I think it's necessary in such a condition
> > that source vendor driver and target vendor driver do not offer data sections of
> > the same size. e.g. in source side, the data section is of size 100M,
> > but in target side, the data section is only of 50M size.
> > rather than simply fail, loop and write seems better.
> > 
> 
> Makes sense. Doing this change for next version.
> 
> > Thanks
> > Yan
> >>> + * b. write data of data_size to migration region from data_offset.
> >>> + * c. write data_size which indicates vendor driver that data is written in
> >>> + *    staging buffer. Vendor driver should read this data from migration
> >>> + *    region and resume device's state.
> >>
> >> The device defaults to _RUNNING state, so a prerequisite is to set
> >> _RESUMING and clear _RUNNING, right?
> 
> Yes.
> 
> >>
> >>> + *
> >>> + * For user application, data is opaque. User should write data in the same
> >>> + * order as received.
> >>> + */
> >>> +
> >>> +struct vfio_device_migration_info {
> >>> +	__u32 device_state;         /* VFIO device state */
> >>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> >>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> >>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> >>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> >>> +				     VFIO_DEVICE_STATE_SAVING |  \
> >>> +				     VFIO_DEVICE_STATE_RESUMING)
> >>> +
> >>> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
> >>> +					    VFIO_DEVICE_STATE_RESUMING)
> >>> +
> >>> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
> >>> +					    VFIO_DEVICE_STATE_RESUMING)
> >>
> >> These seem difficult to use, maybe we just need a
> >> VFIO_DEVICE_STATE_VALID macro?
> >>
> >> #define VFIO_DEVICE_STATE_VALID(state) \
> >>    (state & VFIO_DEVICE_STATE_RESUMING ? \
> >>    (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> >>
> 
> This will not be work when use of other bits gets added in future. 
> That's the reason I preferred to add individual invalid states which 
> user should check.
> 
> Thanks,
> Kirti
> 
> >> Thanks,
> >> Alex
> >>
> >>> +	__u32 reserved;
> >>> +	__u64 pending_bytes;
> >>> +	__u64 data_offset;
> >>> +	__u64 data_size;
> >>> +} __attribute__((packed));
> >>> +
> >>>   /*
> >>>    * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >>>    * which allows direct access to non-MSIX registers which happened to be within
> >>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-13 20:40               ` Alex Williamson
@ 2019-11-14 18:49                 ` Kirti Wankhede
  0 siblings, 0 replies; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-14 18:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm



On 11/14/2019 2:10 AM, Alex Williamson wrote:
> On Thu, 14 Nov 2019 01:47:04 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/14/2019 1:18 AM, Alex Williamson wrote:
>>> On Thu, 14 Nov 2019 00:59:52 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 11/13/2019 11:57 PM, Alex Williamson wrote:
>>>>> On Wed, 13 Nov 2019 11:24:17 +0100
>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>       
>>>>>> On Tue, 12 Nov 2019 15:30:05 -0700
>>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>>      
>>>>>>> On Tue, 12 Nov 2019 22:33:36 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>          
>>>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>>>        Bit 0 => _RUNNING
>>>>>>>>        Bit 1 => _SAVING
>>>>>>>>        Bit 2 => _RESUMING
>>>>>>>>        Combination of these bits defines VFIO device's state during migration
>>>>>>>>        _RUNNING => Normal VFIO device running state. When its reset, it
>>>>>>>> 		indicates _STOPPED state. when device is changed to
>>>>>>>> 		_STOPPED, driver should stop device before write()
>>>>>>>> 		returns.
>>>>>>>>        _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>>>>>>>>                              start saving state of device i.e. pre-copy state
>>>>>>>>        _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
>>>>>>>
>>>>>>> s/should/must/
>>>>>>>          
>>>>>>>>                    save device state,i.e. stop-n-copy state
>>>>>>>>        _RESUMING => VFIO device resuming state.
>>>>>>>>        _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
>>>>>>>
>>>>>>> A table might be useful here and in the uapi header to indicate valid
>>>>>>> states:
>>>>>>
>>>>>> I like that.
>>>>>>      
>>>>>>>
>>>>>>> | _RESUMING | _SAVING | _RUNNING | Description
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     0     |    0    |     1    | Running, default state
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     0     |    1    |     0    | Stopped, migration interface in save mode
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     0     |    1    |     1    | Running, save mode interface, iterative
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     1     |    0    |     0    | Stopped, migration resume interface active
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     1     |    0    |     1    | Invalid (b)
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     1     |    1    |     0    | Invalid (c)
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     1     |    1    |     1    | Invalid (d)
>>>>>>>
>>>>>>> I think we need to consider whether we define (a) as generally
>>>>>>> available, for instance we might want to use it for diagnostics or a
>>>>>>> fatal error condition outside of migration.
>>>>>>>
>>>>>>> Are there hidden assumptions between state transitions here or are
>>>>>>> there specific next possible state diagrams that we need to include as
>>>>>>> well?
>>>>>>
>>>>>> Some kind of state-change diagram might be useful in addition to the
>>>>>> textual description anyway. Let me try, just to make sure I understand
>>>>>> this correctly:
>>>>>>      
>>>>
>>>> During User application initialization, there is one more state change:
>>>>
>>>> 0) 0/0/0 ---- stop to running -----> 0/0/1
>>>
>>> 0/0/0 cannot be the initial state of the device, that would imply that
>>> a device supporting this migration interface breaks backwards
>>> compatibility with all existing vfio userspace code and that code needs
>>> to learn to set the device running as part of its initialization.
>>> That's absolutely unacceptable.  The initial device state must be 0/0/1.
>>>    
>>
>> There isn't any device state for all existing vfio userspace code right
>> now. So default its assumed to be always running.
> 
> Exactly, there is no representation of device state, therefore it's
> assumed to be running, therefore when adding a representation of device
> state it must default to running.
> 
>> With migration support, device states are explicitly getting added. For
>> example, in case of QEMU, while device is getting initialized, i.e. from
>> vfio_realize(), device_state is set to 0/0/0, but not required to convey
>> it to vendor driver.
> 
> But we have a 0/0/0 state, why would we intentionally keep an internal
> state that's inconsistent with the device?
> 
>> Then with vfio_vmstate_change() notifier, device
>> state is changed to 0/0/1 when VM/vCPU are transitioned to running, at
>> this moment device state is conveyed to vendor driver. So vendor driver
>> doesn't see 0/0/0 state.
> 
> But the running state is the state of the device, not the VM or the
> vCPU.  Sure we might want to stop the device if the VM/vCPU state is
> stopped, but we must accept that the device is running when it's opened
> and we shouldn't intentionally maintain inconsistent state.
>   
>> While resuming, for userspace, for example QEMU, device state change is
>> from 0/0/0 to 1/0/0, vendor driver see 1/0/0 after device basic
>> initialization is done.
> 
> I don't see why this matters, all device_state transitions are written
> directly to the vendor driver.  The device is initially in 0/0/1 and
> can be set to 1/0/0 for resuming with an optional transition through
> 0/0/0 and the vendor driver can see each state change.
> 
>>>>>> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1
>>>>
>>>> not just gathering state info, but also copy device state to be
>>>> transferred during pre-copy phase.
>>>>
>>>> Below 2 state are not just to tell driver to stop, those 2 differ.
>>>> 2) is device state changed from running to stop, this is when VM
>>>> shutdowns cleanly, no need to save device state
>>>
>>> Userspace is under no obligation to perform this state change though,
>>> backwards compatibility dictates this.
>>>      
>>>>>> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0
>>>>   
>>>>>> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0
>>>>
>>>> above is transition from pre-copy phase to stop-and-copy phase, where
>>>> device data should be made available to user to transfer to destination
>>>> or to save it to file in case of save VM or suspend.
>>>>
>>>>   
>>>>>> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0
>>>>>
>>>>> I think this is to switch into resuming mode, the data will follow >
>>>>>> 5) 1/0/0 ---(driver is ready)---> 0/0/1
>>>>>> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1
>>>>>      
>>>>
>>>> above can occur on migration cancelled or failed.
>>>>
>>>>   
>>>>> I think also:
>>>>>
>>>>> 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy
>>>>
>>>> that's right, this happens in case of save VM or suspend VM.
>>>>   
>>>>>
>>>>> 0/0/0 and 0/0/1 should be reachable from any state, though I could see
>>>>> that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
>>>>> received state is incomplete.  Somehow though a user always needs to
>>>>> return the device to the initial state, so how does device_state
>>>>> interact with the reset ioctl?  Would this automatically manipulate
>>>>> device_state back to 0/0/1?
>>>>
>>>> why would reset occur on 1/0/0 -> 0/0/1 failure?
>>>
>>> The question is whether the reset ioctl automatically puts the device
>>> back into the initial state, 0/0/1.  A reset from 1/0/0 -> 0/0/1
>>> presumably discards much of the device state we just restored, so
>>> clearly that would be undesirable.
>>>      
>>>> 1/0/0 -> 0/0/1 fails, then user should convey that to source that
>>>> migration has failed, then resume at source.
>>>
>>> In the scheme of the migration yet, but as far as the vfio interface is
>>> concerned the user should have a path to make use of a device after
>>> this point without closing it and starting over.  Thus, if a 1/0/0 ->
>>> 0/0/1 transition fails, would we define the device reset ioctl as a
>>> mechanism to flush the bogus state and place the device into the 0/0/1
>>> initial state?
>>>   
>>
>> Ok, userspace applications can be designed to do that. As of now with
>> QEMU, I don't see a way to reset device on 1/0/0-> 0/0/1 failure.
> 
> It's simply an ioctl, we must already have access to the device file
> descriptor to perform the device_state transition.  QEMU is not
> necessarily the consumer of this behavior though, if transition 1/0/0
> -> 0/0/1 fails in QEMU, it very well may just exit.  The vfio API
> should support a defined mechanism to recover the device from this
> state though, which I propose is the existing reset ioctl, which
> logically implies that any device reset returns the device_state to
> 0/0/1.
> 

Ok.

>>> >>> Not sure about the usefulness of 2).
>>>>
>>>> I explained this above.
>>>>   
>>>>>> Also, is 4) the only way to
>>>>>> trigger resuming?
>>>> Yes.
>>>>   
>>>>>> And is the change in 5) performed by the driver, or
>>>>>> by userspace?
>>>>>>      
>>>> By userspace.
>>>>   
>>>>>> Are any other state transitions valid?
>>>>>>
>>>>>> (...)
>>>>>>      
>>>>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>>>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>>>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>>>>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>>>>>>> + *    Kernel driver should return this read operation only after writing data to
>>>>>>>> + *    staging buffer is done.
>>>>>>>
>>>>>>> "staging buffer" implies a vendor driver implementation, perhaps we
>>>>>>> could just state that data is available from (region + data_offset) to
>>>>>>> (region + data_offset + data_size) upon return of this read operation.
>>>>>>>          
>>>>>>>> + * c. read data_size, amount of data in bytes written by vendor driver in
>>>>>>>> + *    migration region.
>>>>>>>> + * d. read data_size bytes of data from data_offset in the migration region.
>>>>>>>> + * e. process data.
>>>>>>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
>>>>>>>> + *    operation from migration region for previous iteration is done.
>>>>>>>
>>>>>>> I think this indicate that step (f) should be to read pending_bytes, the
>>>>>>> read sequence is not complete until this step.  Optionally the user can
>>>>>>> then proceed to step (b).  There are no read side-effects of (a) afaict.
>>>>>>>
>>>>>>> Is the use required to reach pending_bytes == 0 before changing
>>>>>>> device_state, particularly transitioning to !_RUNNING?  Presumably the
>>>>>>> user can exit this sequence at any time by clearing _SAVING.
>>>>>>
>>>>>> That would be transition 6) above (abort saving and continue). I think
>>>>>> it makes sense not to forbid this.
>>>>>>      
>>>>>>>          
>>>>>>>> + *
>>>>>>>> + * Sequence to be followed while _RESUMING device state:
>>>>>>>> + * While data for this device is available, repeat below steps:
>>>>>>>> + * a. read data_offset from where user application should write data.
>>>>>>>> + * b. write data of data_size to migration region from data_offset.
>>>>>>>> + * c. write data_size which indicates vendor driver that data is written in
>>>>>>>> + *    staging buffer. Vendor driver should read this data from migration
>>>>>>>> + *    region and resume device's state.
>>>>>>>
>>>>>>> The device defaults to _RUNNING state, so a prerequisite is to set
>>>>>>> _RESUMING and clear _RUNNING, right?
>>>>>>      
>>>>
>>>> Sorry, I replied yes in my previous reply, but no. Default device state
>>>> is _STOPPED. During resume _STOPPED -> _RESUMING
>>>
>>> Nope, it can't be, it must be _RUNNING.
>>>    
>>>>>> Transition 4) above. Do we need
>>>>
>>>> I think, its not required.
>>>
>>> But above we say it's the only way to trigger resuming (4 was 0/0/1 ->
>>> 1/0/0).
>>>    
>>>>>> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
>>>>>> as well? (Probably depends on how sensible the 0/0/0 state is.)
>>>>>
>>>>> I think we must unless we require the user to transition from 0/0/1 to
>>>>> 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
>>>>> available.  Thanks,
>>>>>       
>>>>
>>>> its 0/0/0 -> 1/0/0 while resuming.
>>>
>>> I think we're starting with different initial states, IMO there is
>>> absolutely no way around 0/0/1 being the initial device state.
>>> Anything otherwise means that we cannot add migration support to an
>>> existing device and maintain compatibility with existing userspace.
>>> Thanks,
>>>    
>> Hope above explanation helps to resolve this concern.
> 
> Not really, I stand by that the default state must reflect previous
> assumptions and therefore it must be 0/0/1 and additionally we should
> not maintain state in QEMU intentionally inconsistent with the device
> state.  Thanks,
> 

Ok. Will change that

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
  2019-11-14  0:36         ` Yan Zhao
@ 2019-11-14 18:55           ` Kirti Wankhede
  0 siblings, 0 replies; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-14 18:55 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Alex Williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm


>>>>> + * Vendor driver should decide whether to partition data section and how to
>>>>> + * partition the data section. Vendor driver should return data_offset
>>>>> + * accordingly.
>>>>> + *
>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>>>> + *    Kernel driver should return this read operation only after writing data to
>>>>> + *    staging buffer is done.
>>> May I know under what condition this data_offset will be changed per
>>> each iteration from a-f ?
>>>
>>
>> Its upto vendor driver, if vendor driver maintains multiple partitions
>> in data section.
>>
> So, do you mean each time before doing b (reading data_offset), step a
> (reading pending_bytes) is mandatory, otherwise the vendor driver does
> not know which data_offset is?

pending_bytes will only tell bytes remaining to transfer from vendor 
driver. On read operation on data_offset, vendor driver should decide 
what to send depending on where he is making data available to userspace.

> Then, any lock to wrap step a and b to ensure atomic?

With current QEMU implementation, where migration is single thread, 
there is not need of lock yet. But when we add multi-threaded support 
may be in future then locks will be required in userspace side.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-13 20:07       ` Alex Williamson
@ 2019-11-14 18:56         ` Kirti Wankhede
  2019-11-14 21:06           ` Alex Williamson
  0 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-14 18:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 11/14/2019 1:37 AM, Alex Williamson wrote:
> On Thu, 14 Nov 2019 01:07:21 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/13/2019 4:00 AM, Alex Williamson wrote:
>>> On Tue, 12 Nov 2019 22:33:37 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
>>>> considered as dirty during migration. IOMMU container maintains a list of
>>>> all such pinned pages. Added an ioctl defination to get bitmap of such
>>>
>>> definition
>>>    
>>>> pinned pages for requested IO virtual address range.
>>>
>>> Additionally, all mapped pages are considered dirty when physically
>>> mapped through to an IOMMU, modulo we discussed devices opting in to
>>> per page pinning to indicate finer granularity with a TBD mechanism to
>>> figure out if any non-opt-in devices remain.
>>>    
>>
>> You mean, in case of device direct assignment (device pass through)?
> 
> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> pinned and mapped, then the correct dirty page set is all mapped pages.
> We discussed using the vpfn list as a mechanism for vendor drivers to
> reduce their migration footprint, but we also discussed that we would
> need a way to determine that all participants in the container have
> explicitly pinned their working pages or else we must consider the
> entire potential working set as dirty.
> 

How can vendor driver tell this capability to iommu module? Any suggestions?

>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>    include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
>>>>    1 file changed, 23 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 35b09427ad9f..6fd3822aa610 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
>>>>    #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>>>>    #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>>>>    
>>>> +/**
>>>> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
>>>> + *                                     struct vfio_iommu_type1_dirty_bitmap)
>>>> + *
>>>> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
>>>> + * Get dirty pages bitmap of given IO virtual addresses range using
>>>> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
>>>> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
>>>> + * bitmap and should set size of allocated memory in bitmap_size field.
>>>> + * One bit is used to represent per page consecutively starting from iova
>>>> + * offset. Bit set indicates page at that offset from iova is dirty.
>>>> + */
>>>> +struct vfio_iommu_type1_dirty_bitmap {
>>>> +	__u32        argsz;
>>>> +	__u32        flags;
>>>> +	__u64        iova;                      /* IO virtual address */
>>>> +	__u64        size;                      /* Size of iova range */
>>>> +	__u64        bitmap_size;               /* in bytes */
>>>
>>> This seems redundant.  We can calculate the size of the bitmap based on
>>> the iova size.
>>>   
>>
>> But in kernel space, we need to validate the size of memory allocated by
>> user instead of assuming user is always correct, right?
> 
> What does it buy us for the user to tell us the size?  They could be
> wrong, they could be malicious.  The argsz field on the ioctl is mostly
> for the handshake that the user is competent, we should get faults from
> the copy-user operation if it's incorrect.
>

It is to mainly fail safe.

>>>> +	void __user *bitmap;                    /* one bit per page */
>>>
>>> Should we define that as a __u64* to (a) help with the size
>>> calculation, and (b) assure that we can use 8-byte ops on it?
>>>
>>> However, who defines page size?  Is it necessarily the processor page
>>> size?  A physical IOMMU may support page sizes other than the CPU page
>>> size.  It might be more important to indicate the expected page size
>>> than the bitmap size.  Thanks,
>>>   
>>
>> I see in QEMU and in vfio_iommu_type1 module, page sizes considered for
>> mapping are CPU page size, 4K. Do we still need to have such argument?
> 
> That assumption exists for backwards compatibility prior to supporting
> the iova_pgsizes field in vfio_iommu_type1_info.  AFAIK the current
> interface has no page size assumptions and we should not add any.

So userspace has iova_pgsizes information, which can be input to this 
ioctl. Bitmap should be considering smallest page size. Does that makes 
sense?


Thanks,
Kirti

> Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap
  2019-11-13 20:22       ` Alex Williamson
@ 2019-11-14 18:56         ` Kirti Wankhede
  2019-11-14 21:08           ` Alex Williamson
  0 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-11-14 18:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 11/14/2019 1:52 AM, Alex Williamson wrote:
> On Thu, 14 Nov 2019 01:22:39 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/13/2019 4:00 AM, Alex Williamson wrote:
>>> On Tue, 12 Nov 2019 22:33:38 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> With vIOMMU, during pre-copy phase of migration, while CPUs are still
>>>> running, IO virtual address unmap can happen while device still keeping
>>>> reference of guest pfns. Those pages should be reported as dirty before
>>>> unmap, so that VFIO user space application can copy content of those pages
>>>> from source to destination.
>>>>
>>>> IOCTL defination added here add bitmap pointer, size and flag. If flag
>>>
>>> definition, adds
>>>    
>>>> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set and bitmap memory is allocated
>>>> and bitmap_size of set, then ioctl will create bitmap of pinned pages and
>>>
>>> s/of/is/
>>>    
>>>> then unmap those.
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>    include/uapi/linux/vfio.h | 33 +++++++++++++++++++++++++++++++++
>>>>    1 file changed, 33 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 6fd3822aa610..72fd297baf52 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -925,6 +925,39 @@ struct vfio_iommu_type1_dirty_bitmap {
>>>>    
>>>>    #define VFIO_IOMMU_GET_DIRTY_BITMAP             _IO(VFIO_TYPE, VFIO_BASE + 17)
>>>>    
>>>> +/**
>>>> + * VFIO_IOMMU_UNMAP_DMA_GET_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
>>>> + *				      struct vfio_iommu_type1_dma_unmap_bitmap)
>>>> + *
>>>> + * Unmap IO virtual addresses using the provided struct
>>>> + * vfio_iommu_type1_dma_unmap_bitmap.  Caller sets argsz.
>>>> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
>>>> + * before unmapping IO virtual addresses. If this flag is not set, only IO
>>>> + * virtual address are unmapped without creating pinned pages bitmap, that
>>>> + * is, behave same as VFIO_IOMMU_UNMAP_DMA ioctl.
>>>> + * User should allocate memory to get bitmap and should set size of allocated
>>>> + * memory in bitmap_size field. One bit in bitmap is used to represent per page
>>>> + * consecutively starting from iova offset. Bit set indicates page at that
>>>> + * offset from iova is dirty.
>>>> + * The actual unmapped size is returned in the size field and bitmap of pages
>>>> + * in the range of unmapped size is returned in bitmap if flag
>>>> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set.
>>>> + *
>>>> + * No guarantee is made to the user that arbitrary unmaps of iova or size
>>>> + * different from those used in the original mapping call will succeed.
>>>> + */
>>>> +struct vfio_iommu_type1_dma_unmap_bitmap {
>>>> +	__u32        argsz;
>>>> +	__u32        flags;
>>>> +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>>>> +	__u64        iova;                        /* IO virtual address */
>>>> +	__u64        size;                        /* Size of mapping (bytes) */
>>>> +	__u64        bitmap_size;                 /* in bytes */
>>>> +	void __user *bitmap;                      /* one bit per page */
>>>> +};
>>>> +
>>>> +#define VFIO_IOMMU_UNMAP_DMA_GET_BITMAP _IO(VFIO_TYPE, VFIO_BASE + 18)
>>>> +
>>>
>>> Why not extend VFIO_IOMMU_UNMAP_DMA to support this rather than add an
>>> ioctl that duplicates the functionality and extends it??
>>
>> We do want old userspace applications to work with new kernel and
>> vice-versa, right?
>>
>> If I try to change existing VFIO_IOMMU_UNMAP_DMA ioctl structure, say if
>> add 'bitmap_size' and 'bitmap' after 'size', with below code in old
>> kernel, old kernel & new userspace will work.
>>
>>           minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
>>
>>           if (copy_from_user(&unmap, (void __user *)arg, minsz))
>>                   return -EFAULT;
>>
>>           if (unmap.argsz < minsz || unmap.flags)
>>                   return -EINVAL;
>>
>>
>> With new kernel it would change to:
>>           minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, bitmap);
> 
> No, the minimum structure size still ends at size, we interpret flags
> and argsz to learn if the user understands those fields and optionally
> include them.  Therefore old userspace on new kernel continues to work.
> 
>>           if (copy_from_user(&unmap, (void __user *)arg, minsz))
>>                   return -EFAULT;
>>
>>           if (unmap.argsz < minsz || unmap.flags)
>>                   return -EINVAL;
>>
>> Then old userspace app will fail because unmap.argsz < minsz and might
>> be copy_from_user would cause seg fault because userspace sdk doesn't
>> contain new member variables.
>> We can't change the sequence to keep 'size' as last member, because then
>> new userspace app on old kernel will interpret it wrong.
> 
> If we have new userspace on old kernel, that userspace needs to be able
> to learn that this feature exists (new flag in the
> vfio_iommu_type1_info struct as suggested below) and only make use of it
> when available.  This is why the old kernel checks argsz against minsz.
> So long as the user passes something at least minsz in size, we have
> compatibility.  The old kernel doesn't understand the GET_DIRTY_BITMAP
> flag and will return an error if the user attempts to use it.  Thanks,
> 

Ok. So then VFIO_IOMMU_UNMAP_DMA_GET_BITMAP ioctl is not needed. I'll do 
the change. Again bitmap will be created considering smallest page size 
of iova_pgsizes

But VFIO_IOMMU_GET_DIRTY_BITMAP ioctl will still required, right?

Thanks,
Kirti

> Alex
>   
>>> Otherwise
>>> same comments as previous, in fact it's too bad we can't use this ioctl
>>> for both, but a DONT_UNMAP flag on the UNMAP_DMA ioctl seems a bit
>>> absurd.
>>>
>>> I suspect we also want a flags bit in VFIO_IOMMU_GET_INFO to indicate
>>> these capabilities are supported.
>>>    
>>
>> Ok. I'll add that.
>>
>>> Maybe for both ioctls we also want to define it as the user's
>>> responsibility to zero the bitmap, requiring the kernel to only set
>>> bits as necessary.
>>
>> Ok. Updating comment.
>>
>> Thanks,
>> Kirti
>>
>>> Thanks,
>>>
>>> Alex
>>>    
>>>>    /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>>>>    
>>>>    /*
>>>    
>>
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-14 18:56         ` Kirti Wankhede
@ 2019-11-14 21:06           ` Alex Williamson
  2019-11-15  2:40             ` Yan Zhao
  2019-11-26  0:57             ` Yan Zhao
  0 siblings, 2 replies; 46+ messages in thread
From: Alex Williamson @ 2019-11-14 21:06 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Fri, 15 Nov 2019 00:26:07 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/14/2019 1:37 AM, Alex Williamson wrote:
> > On Thu, 14 Nov 2019 01:07:21 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> >>> On Tue, 12 Nov 2019 22:33:37 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> >>>> considered as dirty during migration. IOMMU container maintains a list of
> >>>> all such pinned pages. Added an ioctl defination to get bitmap of such  
> >>>
> >>> definition
> >>>      
> >>>> pinned pages for requested IO virtual address range.  
> >>>
> >>> Additionally, all mapped pages are considered dirty when physically
> >>> mapped through to an IOMMU, modulo we discussed devices opting in to
> >>> per page pinning to indicate finer granularity with a TBD mechanism to
> >>> figure out if any non-opt-in devices remain.
> >>>      
> >>
> >> You mean, in case of device direct assignment (device pass through)?  
> > 
> > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> > pinned and mapped, then the correct dirty page set is all mapped pages.
> > We discussed using the vpfn list as a mechanism for vendor drivers to
> > reduce their migration footprint, but we also discussed that we would
> > need a way to determine that all participants in the container have
> > explicitly pinned their working pages or else we must consider the
> > entire potential working set as dirty.
> >   
> 
> How can vendor driver tell this capability to iommu module? Any suggestions?

I think it does so by pinning pages.  Is it acceptable that if the
vendor driver pins any pages, then from that point forward we consider
the IOMMU group dirty page scope to be limited to pinned pages?  There
are complications around non-singleton IOMMU groups, but I think we're
already leaning towards that being a non-worthwhile problem to solve.
So if we require that only singleton IOMMU groups can pin pages and we
pass the IOMMU group as a parameter to
vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
flag on its local vfio_group struct to indicate dirty page scope is
limited to pinned pages.  We might want to keep a flag on the
vfio_iommu struct to indicate if all of the vfio_groups for each
vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
pinned pages as an optimization to avoid walking lists too often.  Then
we could test if vfio_iommu.domain_list is not empty and this new flag
does not limit the dirty page scope, then everything within each
vfio_dma is considered dirty.
 
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>> ---
> >>>>    include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
> >>>>    1 file changed, 23 insertions(+)
> >>>>
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index 35b09427ad9f..6fd3822aa610 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
> >>>>    #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> >>>>    #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> >>>>    
> >>>> +/**
> >>>> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
> >>>> + *                                     struct vfio_iommu_type1_dirty_bitmap)
> >>>> + *
> >>>> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
> >>>> + * Get dirty pages bitmap of given IO virtual addresses range using
> >>>> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
> >>>> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
> >>>> + * bitmap and should set size of allocated memory in bitmap_size field.
> >>>> + * One bit is used to represent per page consecutively starting from iova
> >>>> + * offset. Bit set indicates page at that offset from iova is dirty.
> >>>> + */
> >>>> +struct vfio_iommu_type1_dirty_bitmap {
> >>>> +	__u32        argsz;
> >>>> +	__u32        flags;
> >>>> +	__u64        iova;                      /* IO virtual address */
> >>>> +	__u64        size;                      /* Size of iova range */
> >>>> +	__u64        bitmap_size;               /* in bytes */  
> >>>
> >>> This seems redundant.  We can calculate the size of the bitmap based on
> >>> the iova size.
> >>>     
> >>
> >> But in kernel space, we need to validate the size of memory allocated by
> >> user instead of assuming user is always correct, right?  
> > 
> > What does it buy us for the user to tell us the size?  They could be
> > wrong, they could be malicious.  The argsz field on the ioctl is mostly
> > for the handshake that the user is competent, we should get faults from
> > the copy-user operation if it's incorrect.
> >  
> 
> It is to mainly fail safe.
> 
> >>>> +	void __user *bitmap;                    /* one bit per page */  
> >>>
> >>> Should we define that as a __u64* to (a) help with the size
> >>> calculation, and (b) assure that we can use 8-byte ops on it?
> >>>
> >>> However, who defines page size?  Is it necessarily the processor page
> >>> size?  A physical IOMMU may support page sizes other than the CPU page
> >>> size.  It might be more important to indicate the expected page size
> >>> than the bitmap size.  Thanks,
> >>>     
> >>
> >> I see in QEMU and in vfio_iommu_type1 module, page sizes considered for
> >> mapping are CPU page size, 4K. Do we still need to have such argument?  
> > 
> > That assumption exists for backwards compatibility prior to supporting
> > the iova_pgsizes field in vfio_iommu_type1_info.  AFAIK the current
> > interface has no page size assumptions and we should not add any.  
> 
> So userspace has iova_pgsizes information, which can be input to this 
> ioctl. Bitmap should be considering smallest page size. Does that makes 
> sense?

I'm not sure.  I thought I had an argument that the iova_pgsize could
indicate support for sizes smaller than the processor page size, which
would make the user responsible for using a different base for their
page size, but vfio_pgsize_bitmap() already masks out sub-page sizes.
Clearly the vendor driver is pinning based on processor sized pages,
but that's independent of an IOMMU and not part of a user ABI.

I'm tempted to say your bitmap_size field has a use here, but it seems
to fail in validating the user page size at the low extremes.  For
example if we have a single page mapping, the user can specify the iova
size as 4K (for example), but the minimum bitmap_size they can indicate
is 1 byte, would we therefore assume the user's bitmap page size is 512
bytes (ie. they provided us with 8 bits to describe a 4K range)?  We'd
need to be careful to specify that the minimum iova_pgsize indicated
page size is our lower bound as well.  But then what do we do if the
user provides us with a smaller buffer than we expect?  For example, a
128MB iova range and only an 8-byte buffer.  Do we go ahead and assume
a 2MB page size and fill the bitmap accordingly or do we generate an
error?  If the latter, might we support that at some point in time and
is it sufficient to let the user perform trial and error to test if that
exists?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap
  2019-11-14 18:56         ` Kirti Wankhede
@ 2019-11-14 21:08           ` Alex Williamson
  0 siblings, 0 replies; 46+ messages in thread
From: Alex Williamson @ 2019-11-14 21:08 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Fri, 15 Nov 2019 00:26:26 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/14/2019 1:52 AM, Alex Williamson wrote:
> > On Thu, 14 Nov 2019 01:22:39 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> >>> On Tue, 12 Nov 2019 22:33:38 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> With vIOMMU, during pre-copy phase of migration, while CPUs are still
> >>>> running, IO virtual address unmap can happen while device still keeping
> >>>> reference of guest pfns. Those pages should be reported as dirty before
> >>>> unmap, so that VFIO user space application can copy content of those pages
> >>>> from source to destination.
> >>>>
> >>>> IOCTL defination added here add bitmap pointer, size and flag. If flag  
> >>>
> >>> definition, adds
> >>>      
> >>>> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set and bitmap memory is allocated
> >>>> and bitmap_size of set, then ioctl will create bitmap of pinned pages and  
> >>>
> >>> s/of/is/
> >>>      
> >>>> then unmap those.
> >>>>
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>> ---
> >>>>    include/uapi/linux/vfio.h | 33 +++++++++++++++++++++++++++++++++
> >>>>    1 file changed, 33 insertions(+)
> >>>>
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index 6fd3822aa610..72fd297baf52 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -925,6 +925,39 @@ struct vfio_iommu_type1_dirty_bitmap {
> >>>>    
> >>>>    #define VFIO_IOMMU_GET_DIRTY_BITMAP             _IO(VFIO_TYPE, VFIO_BASE + 17)
> >>>>    
> >>>> +/**
> >>>> + * VFIO_IOMMU_UNMAP_DMA_GET_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
> >>>> + *				      struct vfio_iommu_type1_dma_unmap_bitmap)
> >>>> + *
> >>>> + * Unmap IO virtual addresses using the provided struct
> >>>> + * vfio_iommu_type1_dma_unmap_bitmap.  Caller sets argsz.
> >>>> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
> >>>> + * before unmapping IO virtual addresses. If this flag is not set, only IO
> >>>> + * virtual address are unmapped without creating pinned pages bitmap, that
> >>>> + * is, behave same as VFIO_IOMMU_UNMAP_DMA ioctl.
> >>>> + * User should allocate memory to get bitmap and should set size of allocated
> >>>> + * memory in bitmap_size field. One bit in bitmap is used to represent per page
> >>>> + * consecutively starting from iova offset. Bit set indicates page at that
> >>>> + * offset from iova is dirty.
> >>>> + * The actual unmapped size is returned in the size field and bitmap of pages
> >>>> + * in the range of unmapped size is returned in bitmap if flag
> >>>> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is set.
> >>>> + *
> >>>> + * No guarantee is made to the user that arbitrary unmaps of iova or size
> >>>> + * different from those used in the original mapping call will succeed.
> >>>> + */
> >>>> +struct vfio_iommu_type1_dma_unmap_bitmap {
> >>>> +	__u32        argsz;
> >>>> +	__u32        flags;
> >>>> +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> >>>> +	__u64        iova;                        /* IO virtual address */
> >>>> +	__u64        size;                        /* Size of mapping (bytes) */
> >>>> +	__u64        bitmap_size;                 /* in bytes */
> >>>> +	void __user *bitmap;                      /* one bit per page */
> >>>> +};
> >>>> +
> >>>> +#define VFIO_IOMMU_UNMAP_DMA_GET_BITMAP _IO(VFIO_TYPE, VFIO_BASE + 18)
> >>>> +  
> >>>
> >>> Why not extend VFIO_IOMMU_UNMAP_DMA to support this rather than add an
> >>> ioctl that duplicates the functionality and extends it??  
> >>
> >> We do want old userspace applications to work with new kernel and
> >> vice-versa, right?
> >>
> >> If I try to change existing VFIO_IOMMU_UNMAP_DMA ioctl structure, say if
> >> add 'bitmap_size' and 'bitmap' after 'size', with below code in old
> >> kernel, old kernel & new userspace will work.
> >>
> >>           minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
> >>
> >>           if (copy_from_user(&unmap, (void __user *)arg, minsz))
> >>                   return -EFAULT;
> >>
> >>           if (unmap.argsz < minsz || unmap.flags)
> >>                   return -EINVAL;
> >>
> >>
> >> With new kernel it would change to:
> >>           minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, bitmap);  
> > 
> > No, the minimum structure size still ends at size, we interpret flags
> > and argsz to learn if the user understands those fields and optionally
> > include them.  Therefore old userspace on new kernel continues to work.
> >   
> >>           if (copy_from_user(&unmap, (void __user *)arg, minsz))
> >>                   return -EFAULT;
> >>
> >>           if (unmap.argsz < minsz || unmap.flags)
> >>                   return -EINVAL;
> >>
> >> Then old userspace app will fail because unmap.argsz < minsz and might
> >> be copy_from_user would cause seg fault because userspace sdk doesn't
> >> contain new member variables.
> >> We can't change the sequence to keep 'size' as last member, because then
> >> new userspace app on old kernel will interpret it wrong.  
> > 
> > If we have new userspace on old kernel, that userspace needs to be able
> > to learn that this feature exists (new flag in the
> > vfio_iommu_type1_info struct as suggested below) and only make use of it
> > when available.  This is why the old kernel checks argsz against minsz.
> > So long as the user passes something at least minsz in size, we have
> > compatibility.  The old kernel doesn't understand the GET_DIRTY_BITMAP
> > flag and will return an error if the user attempts to use it.  Thanks,
> >   
> 
> Ok. So then VFIO_IOMMU_UNMAP_DMA_GET_BITMAP ioctl is not needed. I'll do 
> the change. Again bitmap will be created considering smallest page size 
> of iova_pgsizes
> 
> But VFIO_IOMMU_GET_DIRTY_BITMAP ioctl will still required, right?

Yes, I'm not willing to suggest a flag on an unmap ioctl that
eliminates the unmap just so we can re-use it for retrieving a dirty
page bitmap.  That'd be ugly.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-14 21:06           ` Alex Williamson
@ 2019-11-15  2:40             ` Yan Zhao
  2019-11-15  3:21               ` Alex Williamson
  2019-11-26  0:57             ` Yan Zhao
  1 sibling, 1 reply; 46+ messages in thread
From: Yan Zhao @ 2019-11-15  2:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
> On Fri, 15 Nov 2019 00:26:07 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 11/14/2019 1:37 AM, Alex Williamson wrote:
> > > On Thu, 14 Nov 2019 01:07:21 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > >> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> > >>> On Tue, 12 Nov 2019 22:33:37 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>      
> > >>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> > >>>> considered as dirty during migration. IOMMU container maintains a list of
> > >>>> all such pinned pages. Added an ioctl defination to get bitmap of such  
> > >>>
> > >>> definition
> > >>>      
> > >>>> pinned pages for requested IO virtual address range.  
> > >>>
> > >>> Additionally, all mapped pages are considered dirty when physically
> > >>> mapped through to an IOMMU, modulo we discussed devices opting in to
> > >>> per page pinning to indicate finer granularity with a TBD mechanism to
> > >>> figure out if any non-opt-in devices remain.
> > >>>      
> > >>
> > >> You mean, in case of device direct assignment (device pass through)?  
> > > 
> > > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> > > pinned and mapped, then the correct dirty page set is all mapped pages.
> > > We discussed using the vpfn list as a mechanism for vendor drivers to
> > > reduce their migration footprint, but we also discussed that we would
> > > need a way to determine that all participants in the container have
> > > explicitly pinned their working pages or else we must consider the
> > > entire potential working set as dirty.
> > >   
> > 
> > How can vendor driver tell this capability to iommu module? Any suggestions?
> 
> I think it does so by pinning pages.  Is it acceptable that if the
> vendor driver pins any pages, then from that point forward we consider
> the IOMMU group dirty page scope to be limited to pinned pages?  There
> are complications around non-singleton IOMMU groups, but I think we're
> already leaning towards that being a non-worthwhile problem to solve.
> So if we require that only singleton IOMMU groups can pin pages and we
> pass the IOMMU group as a parameter to
> vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
> flag on its local vfio_group struct to indicate dirty page scope is
> limited to pinned pages.  We might want to keep a flag on the
> vfio_iommu struct to indicate if all of the vfio_groups for each
> vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
> pinned pages as an optimization to avoid walking lists too often.  Then
> we could test if vfio_iommu.domain_list is not empty and this new flag
> does not limit the dirty page scope, then everything within each
> vfio_dma is considered dirty.
>

hi Alex
could you help clarify whether my understandings below are right?
In future,
1. for mdev and for passthrough device withoug hardware ability to track
dirty pages, the vendor driver has to explicitly call
vfio_pin_pages()/vfio_unpin_pages() + a flag to tell vfio its dirty page set.

2. for those devices with hardware ability to track dirty pages, will still
provide a callback to vendor driver to get dirty pages. (as for those devices,
it is hard to explicitly call vfio_pin_pages()/vfio_unpin_pages())

3. for devices relying on dirty bit info in physical IOMMU, there
will be a callback to physical IOMMU driver to get dirty page set from
vfio.

Thanks
Yan

> > >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >>>> ---
> > >>>>    include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
> > >>>>    1 file changed, 23 insertions(+)
> > >>>>
> > >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > >>>> index 35b09427ad9f..6fd3822aa610 100644
> > >>>> --- a/include/uapi/linux/vfio.h
> > >>>> +++ b/include/uapi/linux/vfio.h
> > >>>> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
> > >>>>    #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > >>>>    #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > >>>>    
> > >>>> +/**
> > >>>> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
> > >>>> + *                                     struct vfio_iommu_type1_dirty_bitmap)
> > >>>> + *
> > >>>> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
> > >>>> + * Get dirty pages bitmap of given IO virtual addresses range using
> > >>>> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
> > >>>> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
> > >>>> + * bitmap and should set size of allocated memory in bitmap_size field.
> > >>>> + * One bit is used to represent per page consecutively starting from iova
> > >>>> + * offset. Bit set indicates page at that offset from iova is dirty.
> > >>>> + */
> > >>>> +struct vfio_iommu_type1_dirty_bitmap {
> > >>>> +	__u32        argsz;
> > >>>> +	__u32        flags;
> > >>>> +	__u64        iova;                      /* IO virtual address */
> > >>>> +	__u64        size;                      /* Size of iova range */
> > >>>> +	__u64        bitmap_size;               /* in bytes */  
> > >>>
> > >>> This seems redundant.  We can calculate the size of the bitmap based on
> > >>> the iova size.
> > >>>     
> > >>
> > >> But in kernel space, we need to validate the size of memory allocated by
> > >> user instead of assuming user is always correct, right?  
> > > 
> > > What does it buy us for the user to tell us the size?  They could be
> > > wrong, they could be malicious.  The argsz field on the ioctl is mostly
> > > for the handshake that the user is competent, we should get faults from
> > > the copy-user operation if it's incorrect.
> > >  
> > 
> > It is to mainly fail safe.
> > 
> > >>>> +	void __user *bitmap;                    /* one bit per page */  
> > >>>
> > >>> Should we define that as a __u64* to (a) help with the size
> > >>> calculation, and (b) assure that we can use 8-byte ops on it?
> > >>>
> > >>> However, who defines page size?  Is it necessarily the processor page
> > >>> size?  A physical IOMMU may support page sizes other than the CPU page
> > >>> size.  It might be more important to indicate the expected page size
> > >>> than the bitmap size.  Thanks,
> > >>>     
> > >>
> > >> I see in QEMU and in vfio_iommu_type1 module, page sizes considered for
> > >> mapping are CPU page size, 4K. Do we still need to have such argument?  
> > > 
> > > That assumption exists for backwards compatibility prior to supporting
> > > the iova_pgsizes field in vfio_iommu_type1_info.  AFAIK the current
> > > interface has no page size assumptions and we should not add any.  
> > 
> > So userspace has iova_pgsizes information, which can be input to this 
> > ioctl. Bitmap should be considering smallest page size. Does that makes 
> > sense?
> 
> I'm not sure.  I thought I had an argument that the iova_pgsize could
> indicate support for sizes smaller than the processor page size, which
> would make the user responsible for using a different base for their
> page size, but vfio_pgsize_bitmap() already masks out sub-page sizes.
> Clearly the vendor driver is pinning based on processor sized pages,
> but that's independent of an IOMMU and not part of a user ABI.
> 
> I'm tempted to say your bitmap_size field has a use here, but it seems
> to fail in validating the user page size at the low extremes.  For
> example if we have a single page mapping, the user can specify the iova
> size as 4K (for example), but the minimum bitmap_size they can indicate
> is 1 byte, would we therefore assume the user's bitmap page size is 512
> bytes (ie. they provided us with 8 bits to describe a 4K range)?  We'd
> need to be careful to specify that the minimum iova_pgsize indicated
> page size is our lower bound as well.  But then what do we do if the
> user provides us with a smaller buffer than we expect?  For example, a
> 128MB iova range and only an 8-byte buffer.  Do we go ahead and assume
> a 2MB page size and fill the bitmap accordingly or do we generate an
> error?  If the latter, might we support that at some point in time and
> is it sufficient to let the user perform trial and error to test if that
> exists?  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-15  2:40             ` Yan Zhao
@ 2019-11-15  3:21               ` Alex Williamson
  2019-11-15  5:10                 ` Tian, Kevin
  2019-11-20  1:51                 ` Yan Zhao
  0 siblings, 2 replies; 46+ messages in thread
From: Alex Williamson @ 2019-11-15  3:21 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Thu, 14 Nov 2019 21:40:35 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
> > On Fri, 15 Nov 2019 00:26:07 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> > > > On Thu, 14 Nov 2019 01:07:21 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >     
> > > >> On 11/13/2019 4:00 AM, Alex Williamson wrote:    
> > > >>> On Tue, 12 Nov 2019 22:33:37 +0530
> > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >>>        
> > > >>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> > > >>>> considered as dirty during migration. IOMMU container maintains a list of
> > > >>>> all such pinned pages. Added an ioctl defination to get bitmap of such    
> > > >>>
> > > >>> definition
> > > >>>        
> > > >>>> pinned pages for requested IO virtual address range.    
> > > >>>
> > > >>> Additionally, all mapped pages are considered dirty when physically
> > > >>> mapped through to an IOMMU, modulo we discussed devices opting in to
> > > >>> per page pinning to indicate finer granularity with a TBD mechanism to
> > > >>> figure out if any non-opt-in devices remain.
> > > >>>        
> > > >>
> > > >> You mean, in case of device direct assignment (device pass through)?    
> > > > 
> > > > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> > > > pinned and mapped, then the correct dirty page set is all mapped pages.
> > > > We discussed using the vpfn list as a mechanism for vendor drivers to
> > > > reduce their migration footprint, but we also discussed that we would
> > > > need a way to determine that all participants in the container have
> > > > explicitly pinned their working pages or else we must consider the
> > > > entire potential working set as dirty.
> > > >     
> > > 
> > > How can vendor driver tell this capability to iommu module? Any suggestions?  
> > 
> > I think it does so by pinning pages.  Is it acceptable that if the
> > vendor driver pins any pages, then from that point forward we consider
> > the IOMMU group dirty page scope to be limited to pinned pages?  There
> > are complications around non-singleton IOMMU groups, but I think we're
> > already leaning towards that being a non-worthwhile problem to solve.
> > So if we require that only singleton IOMMU groups can pin pages and we
> > pass the IOMMU group as a parameter to
> > vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
> > flag on its local vfio_group struct to indicate dirty page scope is
> > limited to pinned pages.  We might want to keep a flag on the
> > vfio_iommu struct to indicate if all of the vfio_groups for each
> > vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
> > pinned pages as an optimization to avoid walking lists too often.  Then
> > we could test if vfio_iommu.domain_list is not empty and this new flag
> > does not limit the dirty page scope, then everything within each
> > vfio_dma is considered dirty.
> >  
> 
> hi Alex
> could you help clarify whether my understandings below are right?
> In future,
> 1. for mdev and for passthrough device withoug hardware ability to track
> dirty pages, the vendor driver has to explicitly call
> vfio_pin_pages()/vfio_unpin_pages() + a flag to tell vfio its dirty page set.

For non-IOMMU backed mdevs without hardware dirty page tracking,
there's no change to the vendor driver currently.  Pages pinned by the
vendor driver are marked as dirty.

For any IOMMU backed device, mdev or direct assignment, all mapped
memory would be considered dirty unless there are explicit calls to pin
pages on top of the IOMMU page pinning and mapping.  These would likely
be enabled only when the device is in the _SAVING device_state.

> 2. for those devices with hardware ability to track dirty pages, will still
> provide a callback to vendor driver to get dirty pages. (as for those devices,
> it is hard to explicitly call vfio_pin_pages()/vfio_unpin_pages())
>
> 3. for devices relying on dirty bit info in physical IOMMU, there
> will be a callback to physical IOMMU driver to get dirty page set from
> vfio.

The proposal here does not cover exactly how these would be
implemented, it only establishes the container as the point of user
interaction with the dirty bitmap and hopefully allows us to maintain
that interface regardless of whether we have dirty tracking at the
device or the system IOMMU.  Ideally devices with dirty tracking would
make use of page pinning and we'd extend the interface to allow vendor
drivers the ability to indicate the clean/dirty state of those pinned
pages.  For system IOMMU dirty page tracking, that potentially might
mean that we support IOMMU page faults and the container manages those
faults such that the container is the central record of dirty pages.
Until these interfaces are designed, we can only speculate, but the
goal is to design a user interface compatible with how those features
might evolve.  If you identify something that can't work, please raise
the issue.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-15  3:21               ` Alex Williamson
@ 2019-11-15  5:10                 ` Tian, Kevin
  2019-11-19 23:16                   ` Alex Williamson
  2019-11-20  1:51                 ` Yan Zhao
  1 sibling, 1 reply; 46+ messages in thread
From: Tian, Kevin @ 2019-11-15  5:10 UTC (permalink / raw)
  To: Alex Williamson, Zhao, Yan Y
  Cc: Kirti Wankhede, cjia, Yang, Ziye, Liu, Changpeng, Liu, Yi L,
	mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang,
	Zhi A, qemu-devel, kvm

> From: Alex Williamson
> Sent: Friday, November 15, 2019 11:22 AM
> 
> On Thu, 14 Nov 2019 21:40:35 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
> > > On Fri, 15 Nov 2019 00:26:07 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > > > On 11/14/2019 1:37 AM, Alex Williamson wrote:
> > > > > On Thu, 14 Nov 2019 01:07:21 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >
> > > > >> On 11/13/2019 4:00 AM, Alex Williamson wrote:
> > > > >>> On Tue, 12 Nov 2019 22:33:37 +0530
> > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >>>
> > > > >>>> All pages pinned by vendor driver through vfio_pin_pages API
> should be
> > > > >>>> considered as dirty during migration. IOMMU container
> maintains a list of
> > > > >>>> all such pinned pages. Added an ioctl defination to get bitmap of
> such
> > > > >>>
> > > > >>> definition
> > > > >>>
> > > > >>>> pinned pages for requested IO virtual address range.
> > > > >>>
> > > > >>> Additionally, all mapped pages are considered dirty when
> physically
> > > > >>> mapped through to an IOMMU, modulo we discussed devices
> opting in to
> > > > >>> per page pinning to indicate finer granularity with a TBD
> mechanism to
> > > > >>> figure out if any non-opt-in devices remain.
> > > > >>>
> > > > >>
> > > > >> You mean, in case of device direct assignment (device pass
> through)?
> > > > >
> > > > > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are
> fully
> > > > > pinned and mapped, then the correct dirty page set is all mapped
> pages.
> > > > > We discussed using the vpfn list as a mechanism for vendor drivers
> to
> > > > > reduce their migration footprint, but we also discussed that we
> would
> > > > > need a way to determine that all participants in the container have
> > > > > explicitly pinned their working pages or else we must consider the
> > > > > entire potential working set as dirty.
> > > > >
> > > >
> > > > How can vendor driver tell this capability to iommu module? Any
> suggestions?
> > >
> > > I think it does so by pinning pages.  Is it acceptable that if the
> > > vendor driver pins any pages, then from that point forward we consider
> > > the IOMMU group dirty page scope to be limited to pinned pages?
> There
> > > are complications around non-singleton IOMMU groups, but I think
> we're
> > > already leaning towards that being a non-worthwhile problem to solve.
> > > So if we require that only singleton IOMMU groups can pin pages and
> we
> > > pass the IOMMU group as a parameter to
> > > vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
> > > flag on its local vfio_group struct to indicate dirty page scope is
> > > limited to pinned pages.  We might want to keep a flag on the
> > > vfio_iommu struct to indicate if all of the vfio_groups for each
> > > vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
> > > pinned pages as an optimization to avoid walking lists too often.  Then
> > > we could test if vfio_iommu.domain_list is not empty and this new flag
> > > does not limit the dirty page scope, then everything within each
> > > vfio_dma is considered dirty.
> > >
> >
> > hi Alex
> > could you help clarify whether my understandings below are right?
> > In future,
> > 1. for mdev and for passthrough device withoug hardware ability to track
> > dirty pages, the vendor driver has to explicitly call
> > vfio_pin_pages()/vfio_unpin_pages() + a flag to tell vfio its dirty page set.
> 
> For non-IOMMU backed mdevs without hardware dirty page tracking,
> there's no change to the vendor driver currently.  Pages pinned by the
> vendor driver are marked as dirty.

What about the vendor driver can figure out, in software means, which
pinned pages are actually dirty? In that case, would a separate mark_dirty
interface make more sense? Or introduce read/write flag to the pin_pages
interface similar to DMA API? Existing drivers always set both r/w flags but
just in case then a specific driver may set read-only or write-only...

> 
> For any IOMMU backed device, mdev or direct assignment, all mapped
> memory would be considered dirty unless there are explicit calls to pin
> pages on top of the IOMMU page pinning and mapping.  These would likely
> be enabled only when the device is in the _SAVING device_state.
> 
> > 2. for those devices with hardware ability to track dirty pages, will still
> > provide a callback to vendor driver to get dirty pages. (as for those
> devices,
> > it is hard to explicitly call vfio_pin_pages()/vfio_unpin_pages())
> >
> > 3. for devices relying on dirty bit info in physical IOMMU, there
> > will be a callback to physical IOMMU driver to get dirty page set from
> > vfio.
> 
> The proposal here does not cover exactly how these would be
> implemented, it only establishes the container as the point of user
> interaction with the dirty bitmap and hopefully allows us to maintain
> that interface regardless of whether we have dirty tracking at the
> device or the system IOMMU.  Ideally devices with dirty tracking would
> make use of page pinning and we'd extend the interface to allow vendor
> drivers the ability to indicate the clean/dirty state of those pinned

I don't think "dirty tracking" == "page pinning". It's possible that a device
support tracking/logging dirty page info into a driver-registered buffer, 
then the host vendor driver doesn't need to mediate fast-path operations. 
In such case, the entire guest memory is always pinned and we just need 
a log-sync like interface for vendor driver to fill dirty bitmap.

> pages.  For system IOMMU dirty page tracking, that potentially might
> mean that we support IOMMU page faults and the container manages
> those
> faults such that the container is the central record of dirty pages.

IOMMU dirty-bit is not equivalent to IOMMU page fault. The latter
is much more complex which requires support both in IOMMU and in
device. Here similar to above device-dirty-tracking case, we just need a
log-sync interface calling into iommu driver to get dirty info filled for
requested address range.

> Until these interfaces are designed, we can only speculate, but the
> goal is to design a user interface compatible with how those features
> might evolve.  If you identify something that can't work, please raise
> the issue.  Thanks,
> 
> Alex

Here is the desired scheme in my mind. Feel free to correct me. :-)

1. iommu log-buf callback is preferred if underlying IOMMU reports
such capability. The iommu driver walks IOMMU page table to find
dirty pages for requested address range;
2. otherwise vendor driver log-buf callback is preferred if the vendor
driver reports such capability when registering mdev types. The
vendor driver calls device-specific interface to fill dirty info;
3. otherwise pages pined by vfio_pin_pages (with WRITE flag) are
considered dirty. This covers normal mediated devices or using
fast-path mediation for migrating passthrough device;
4. otherwise all mapped pages are considered dirty;

Currently we're working on 1) based on VT-d rev3.0. I know some
vendors implement 2) in their own code base. 3) has real usages 
already. 4) is the fall-back.

Alex, are you willing to have all the interfaces ready in one batch,
or support them based on available usages? I'm fine with either
way, but even just doing 3/4 in this series, I'd prefer to having
above scheme included in the code comment, to give the whole 
picture of all possible situations. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-15  5:10                 ` Tian, Kevin
@ 2019-11-19 23:16                   ` Alex Williamson
  2019-11-20  1:04                     ` Tian, Kevin
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-11-19 23:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhao, Yan Y, Kirti Wankhede, cjia, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Fri, 15 Nov 2019 05:10:53 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Friday, November 15, 2019 11:22 AM
> > 
> > On Thu, 14 Nov 2019 21:40:35 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:  
> > > > On Fri, 15 Nov 2019 00:26:07 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >  
> > > > > On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> > > > > > On Thu, 14 Nov 2019 01:07:21 +0530
> > > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > >  
> > > > > >> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> > > > > >>> On Tue, 12 Nov 2019 22:33:37 +0530
> > > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > >>>  
> > > > > >>>> All pages pinned by vendor driver through vfio_pin_pages API  
> > should be  
> > > > > >>>> considered as dirty during migration. IOMMU container  
> > maintains a list of  
> > > > > >>>> all such pinned pages. Added an ioctl defination to get bitmap of  
> > such  
> > > > > >>>
> > > > > >>> definition
> > > > > >>>  
> > > > > >>>> pinned pages for requested IO virtual address range.  
> > > > > >>>
> > > > > >>> Additionally, all mapped pages are considered dirty when  
> > physically  
> > > > > >>> mapped through to an IOMMU, modulo we discussed devices  
> > opting in to  
> > > > > >>> per page pinning to indicate finer granularity with a TBD  
> > mechanism to  
> > > > > >>> figure out if any non-opt-in devices remain.
> > > > > >>>  
> > > > > >>
> > > > > >> You mean, in case of device direct assignment (device pass  
> > through)?  
> > > > > >
> > > > > > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are  
> > fully  
> > > > > > pinned and mapped, then the correct dirty page set is all mapped  
> > pages.  
> > > > > > We discussed using the vpfn list as a mechanism for vendor drivers  
> > to  
> > > > > > reduce their migration footprint, but we also discussed that we  
> > would  
> > > > > > need a way to determine that all participants in the container have
> > > > > > explicitly pinned their working pages or else we must consider the
> > > > > > entire potential working set as dirty.
> > > > > >  
> > > > >
> > > > > How can vendor driver tell this capability to iommu module? Any  
> > suggestions?  
> > > >
> > > > I think it does so by pinning pages.  Is it acceptable that if the
> > > > vendor driver pins any pages, then from that point forward we consider
> > > > the IOMMU group dirty page scope to be limited to pinned pages?  
> > There  
> > > > are complications around non-singleton IOMMU groups, but I think  
> > we're  
> > > > already leaning towards that being a non-worthwhile problem to solve.
> > > > So if we require that only singleton IOMMU groups can pin pages and  
> > we  
> > > > pass the IOMMU group as a parameter to
> > > > vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
> > > > flag on its local vfio_group struct to indicate dirty page scope is
> > > > limited to pinned pages.  We might want to keep a flag on the
> > > > vfio_iommu struct to indicate if all of the vfio_groups for each
> > > > vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
> > > > pinned pages as an optimization to avoid walking lists too often.  Then
> > > > we could test if vfio_iommu.domain_list is not empty and this new flag
> > > > does not limit the dirty page scope, then everything within each
> > > > vfio_dma is considered dirty.
> > > >  
> > >
> > > hi Alex
> > > could you help clarify whether my understandings below are right?
> > > In future,
> > > 1. for mdev and for passthrough device withoug hardware ability to track
> > > dirty pages, the vendor driver has to explicitly call
> > > vfio_pin_pages()/vfio_unpin_pages() + a flag to tell vfio its dirty page set.  
> > 
> > For non-IOMMU backed mdevs without hardware dirty page tracking,
> > there's no change to the vendor driver currently.  Pages pinned by the
> > vendor driver are marked as dirty.  
> 
> What about the vendor driver can figure out, in software means, which
> pinned pages are actually dirty? In that case, would a separate mark_dirty
> interface make more sense? Or introduce read/write flag to the pin_pages
> interface similar to DMA API? Existing drivers always set both r/w flags but
> just in case then a specific driver may set read-only or write-only...

You're jumping ahead to 2. below, where my reply is that we need to
extend the interface to allow the vendor driver to manipulate
clean/dirty state.  I don't know exactly what those interfaces should
look like, but yes, something should exist to allow that control.  If
the default is to mark pinned pages dirty, then we might need a
separate pin_pages_clean callback.

> > For any IOMMU backed device, mdev or direct assignment, all mapped
> > memory would be considered dirty unless there are explicit calls to pin
> > pages on top of the IOMMU page pinning and mapping.  These would likely
> > be enabled only when the device is in the _SAVING device_state.
> >   
> > > 2. for those devices with hardware ability to track dirty pages, will still
> > > provide a callback to vendor driver to get dirty pages. (as for those  
> > devices,  
> > > it is hard to explicitly call vfio_pin_pages()/vfio_unpin_pages())
> > >
> > > 3. for devices relying on dirty bit info in physical IOMMU, there
> > > will be a callback to physical IOMMU driver to get dirty page set from
> > > vfio.  
> > 
> > The proposal here does not cover exactly how these would be
> > implemented, it only establishes the container as the point of user
> > interaction with the dirty bitmap and hopefully allows us to maintain
> > that interface regardless of whether we have dirty tracking at the
> > device or the system IOMMU.  Ideally devices with dirty tracking would
> > make use of page pinning and we'd extend the interface to allow vendor
> > drivers the ability to indicate the clean/dirty state of those pinned  
> 
> I don't think "dirty tracking" == "page pinning". It's possible that a device
> support tracking/logging dirty page info into a driver-registered buffer, 
> then the host vendor driver doesn't need to mediate fast-path operations. 
> In such case, the entire guest memory is always pinned and we just need 
> a log-sync like interface for vendor driver to fill dirty bitmap.

An mdev device only has access to the pages that have been pinned on
behalf of the device, so just as we assume that any page pinned and
mapped through the IOMMU might be dirtied by a device, we can by
default assume that an page pinned for an mdev device is dirty.  This
maps fairly well to our existing mdev devices that don't seem to have
finer granularity dirty page tracking.  As I state above though, I
certainly don't expect this to be the final extent of dirty page
tracking.  I'm reluctant to commit to a log-sync interface though as
that seems to put the responsibility on the container to poll every
attached device whereas I was rather hoping that making the container
the central interface for dirty tracking would have devices marking
dirty pages in the container asynchronous to the user polling.

> > pages.  For system IOMMU dirty page tracking, that potentially might
> > mean that we support IOMMU page faults and the container manages
> > those
> > faults such that the container is the central record of dirty pages.  
> 
> IOMMU dirty-bit is not equivalent to IOMMU page fault. The latter
> is much more complex which requires support both in IOMMU and in
> device. Here similar to above device-dirty-tracking case, we just need a
> log-sync interface calling into iommu driver to get dirty info filled for
> requested address range.
> 
> > Until these interfaces are designed, we can only speculate, but the
> > goal is to design a user interface compatible with how those features
> > might evolve.  If you identify something that can't work, please raise
> > the issue.  Thanks,
> > 
> > Alex  
> 
> Here is the desired scheme in my mind. Feel free to correct me. :-)
> 
> 1. iommu log-buf callback is preferred if underlying IOMMU reports
> such capability. The iommu driver walks IOMMU page table to find
> dirty pages for requested address range;
> 2. otherwise vendor driver log-buf callback is preferred if the vendor
> driver reports such capability when registering mdev types. The
> vendor driver calls device-specific interface to fill dirty info;
> 3. otherwise pages pined by vfio_pin_pages (with WRITE flag) are
> considered dirty. This covers normal mediated devices or using
> fast-path mediation for migrating passthrough device;
> 4. otherwise all mapped pages are considered dirty;
> 
> Currently we're working on 1) based on VT-d rev3.0. I know some
> vendors implement 2) in their own code base. 3) has real usages 
> already. 4) is the fall-back.
> 
> Alex, are you willing to have all the interfaces ready in one batch,
> or support them based on available usages? I'm fine with either
> way, but even just doing 3/4 in this series, I'd prefer to having
> above scheme included in the code comment, to give the whole 
> picture of all possible situations. :-)

My intention was to cover 3 & 4 given the current state of device and
IOMMU dirty tracking.  I expect the user interface to remain unchanged
for 1 & 2 but the API between vfio, the vendor drivers, and the IOMMU
is internal to the kernel and more flexible.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-19 23:16                   ` Alex Williamson
@ 2019-11-20  1:04                     ` Tian, Kevin
  0 siblings, 0 replies; 46+ messages in thread
From: Tian, Kevin @ 2019-11-20  1:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhao, Yan Y, Kirti Wankhede, cjia, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, November 20, 2019 7:17 AM
> 
> On Fri, 15 Nov 2019 05:10:53 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson
> > > Sent: Friday, November 15, 2019 11:22 AM
> > >
> > > On Thu, 14 Nov 2019 21:40:35 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > > On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
> > > > > On Fri, 15 Nov 2019 00:26:07 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >
> > > > > > On 11/14/2019 1:37 AM, Alex Williamson wrote:
> > > > > > > On Thu, 14 Nov 2019 01:07:21 +0530
> > > > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > > >
> > > > > > >> On 11/13/2019 4:00 AM, Alex Williamson wrote:
> > > > > > >>> On Tue, 12 Nov 2019 22:33:37 +0530
> > > > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > > >>>
> > > > > > >>>> All pages pinned by vendor driver through vfio_pin_pages
> API
> > > should be
> > > > > > >>>> considered as dirty during migration. IOMMU container
> > > maintains a list of
> > > > > > >>>> all such pinned pages. Added an ioctl defination to get
> bitmap of
> > > such
> > > > > > >>>
> > > > > > >>> definition
> > > > > > >>>
> > > > > > >>>> pinned pages for requested IO virtual address range.
> > > > > > >>>
> > > > > > >>> Additionally, all mapped pages are considered dirty when
> > > physically
> > > > > > >>> mapped through to an IOMMU, modulo we discussed devices
> > > opting in to
> > > > > > >>> per page pinning to indicate finer granularity with a TBD
> > > mechanism to
> > > > > > >>> figure out if any non-opt-in devices remain.
> > > > > > >>>
> > > > > > >>
> > > > > > >> You mean, in case of device direct assignment (device pass
> > > through)?
> > > > > > >
> > > > > > > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are
> > > fully
> > > > > > > pinned and mapped, then the correct dirty page set is all
> mapped
> > > pages.
> > > > > > > We discussed using the vpfn list as a mechanism for vendor
> drivers
> > > to
> > > > > > > reduce their migration footprint, but we also discussed that we
> > > would
> > > > > > > need a way to determine that all participants in the container
> have
> > > > > > > explicitly pinned their working pages or else we must consider
> the
> > > > > > > entire potential working set as dirty.
> > > > > > >
> > > > > >
> > > > > > How can vendor driver tell this capability to iommu module? Any
> > > suggestions?
> > > > >
> > > > > I think it does so by pinning pages.  Is it acceptable that if the
> > > > > vendor driver pins any pages, then from that point forward we
> consider
> > > > > the IOMMU group dirty page scope to be limited to pinned pages?
> > > There
> > > > > are complications around non-singleton IOMMU groups, but I think
> > > we're
> > > > > already leaning towards that being a non-worthwhile problem to
> solve.
> > > > > So if we require that only singleton IOMMU groups can pin pages
> and
> > > we
> > > > > pass the IOMMU group as a parameter to
> > > > > vfio_iommu_driver_ops.pin_pages(), then the type1 backend can
> set a
> > > > > flag on its local vfio_group struct to indicate dirty page scope is
> > > > > limited to pinned pages.  We might want to keep a flag on the
> > > > > vfio_iommu struct to indicate if all of the vfio_groups for each
> > > > > vfio_domain in the vfio_iommu.domain_list dirty page scope limited
> to
> > > > > pinned pages as an optimization to avoid walking lists too often.
> Then
> > > > > we could test if vfio_iommu.domain_list is not empty and this new
> flag
> > > > > does not limit the dirty page scope, then everything within each
> > > > > vfio_dma is considered dirty.
> > > > >
> > > >
> > > > hi Alex
> > > > could you help clarify whether my understandings below are right?
> > > > In future,
> > > > 1. for mdev and for passthrough device withoug hardware ability to
> track
> > > > dirty pages, the vendor driver has to explicitly call
> > > > vfio_pin_pages()/vfio_unpin_pages() + a flag to tell vfio its dirty page
> set.
> > >
> > > For non-IOMMU backed mdevs without hardware dirty page tracking,
> > > there's no change to the vendor driver currently.  Pages pinned by the
> > > vendor driver are marked as dirty.
> >
> > What about the vendor driver can figure out, in software means, which
> > pinned pages are actually dirty? In that case, would a separate mark_dirty
> > interface make more sense? Or introduce read/write flag to the
> pin_pages
> > interface similar to DMA API? Existing drivers always set both r/w flags
> but
> > just in case then a specific driver may set read-only or write-only...
> 
> You're jumping ahead to 2. below, where my reply is that we need to

They are different. 2) is about hardware support which may collect
dirty page info in a log buffer and then report to driver when the latter
requests. Here I'm talking about software approach, i.e. when the
vendor driver intercepts certain guest operations, it may figure out
whether the captured DMA pages are used for write or read. The
hardware approach is log-based, which needs a driver callback so 
container can notify vendor driver to report. The latter is trap-based, 
which needs a VFIO API to update dirty status synchronously. 

> extend the interface to allow the vendor driver to manipulate
> clean/dirty state.  I don't know exactly what those interfaces should
> look like, but yes, something should exist to allow that control.  If
> the default is to mark pinned pages dirty, then we might need a
> separate pin_pages_clean callback.
> 
> > > For any IOMMU backed device, mdev or direct assignment, all mapped
> > > memory would be considered dirty unless there are explicit calls to pin
> > > pages on top of the IOMMU page pinning and mapping.  These would
> likely
> > > be enabled only when the device is in the _SAVING device_state.
> > >
> > > > 2. for those devices with hardware ability to track dirty pages, will still
> > > > provide a callback to vendor driver to get dirty pages. (as for those
> > > devices,
> > > > it is hard to explicitly call vfio_pin_pages()/vfio_unpin_pages())
> > > >
> > > > 3. for devices relying on dirty bit info in physical IOMMU, there
> > > > will be a callback to physical IOMMU driver to get dirty page set from
> > > > vfio.
> > >
> > > The proposal here does not cover exactly how these would be
> > > implemented, it only establishes the container as the point of user
> > > interaction with the dirty bitmap and hopefully allows us to maintain
> > > that interface regardless of whether we have dirty tracking at the
> > > device or the system IOMMU.  Ideally devices with dirty tracking would
> > > make use of page pinning and we'd extend the interface to allow
> vendor
> > > drivers the ability to indicate the clean/dirty state of those pinned
> >
> > I don't think "dirty tracking" == "page pinning". It's possible that a device
> > support tracking/logging dirty page info into a driver-registered buffer,
> > then the host vendor driver doesn't need to mediate fast-path operations.
> > In such case, the entire guest memory is always pinned and we just need
> > a log-sync like interface for vendor driver to fill dirty bitmap.
> 
> An mdev device only has access to the pages that have been pinned on
> behalf of the device, so just as we assume that any page pinned and
> mapped through the IOMMU might be dirtied by a device, we can by
> default assume that an page pinned for an mdev device is dirty.  This
> maps fairly well to our existing mdev devices that don't seem to have
> finer granularity dirty page tracking.  As I state above though, I
> certainly don't expect this to be the final extent of dirty page
> tracking.  I'm reluctant to commit to a log-sync interface though as
> that seems to put the responsibility on the container to poll every
> attached device whereas I was rather hoping that making the container
> the central interface for dirty tracking would have devices marking
> dirty pages in the container asynchronous to the user polling.

Having device to mark dirty pages asynchronous only applies to the
software approach which tracks dirty pages by mediating fast-path
guest operations. In case the device logging dirty info into a buffer,
we need a log-sync interface so vendor driver can be notified to
collect hardware-logged information. Whether to have vendor driver
directly update container's dirty bitmap, or have vendor driver to
call mark_dirty for every recorded dirty page, it's just an implementation
choice and you make the decision. :-) This is actually similar to IOMMU 
dirty-bit collection, where we also need an interface to notify IOMMU 
driver to collect its dirty bits.

> 
> > > pages.  For system IOMMU dirty page tracking, that potentially might
> > > mean that we support IOMMU page faults and the container manages
> > > those
> > > faults such that the container is the central record of dirty pages.
> >
> > IOMMU dirty-bit is not equivalent to IOMMU page fault. The latter
> > is much more complex which requires support both in IOMMU and in
> > device. Here similar to above device-dirty-tracking case, we just need a
> > log-sync interface calling into iommu driver to get dirty info filled for
> > requested address range.
> >
> > > Until these interfaces are designed, we can only speculate, but the
> > > goal is to design a user interface compatible with how those features
> > > might evolve.  If you identify something that can't work, please raise
> > > the issue.  Thanks,
> > >
> > > Alex
> >
> > Here is the desired scheme in my mind. Feel free to correct me. :-)
> >
> > 1. iommu log-buf callback is preferred if underlying IOMMU reports
> > such capability. The iommu driver walks IOMMU page table to find
> > dirty pages for requested address range;
> > 2. otherwise vendor driver log-buf callback is preferred if the vendor
> > driver reports such capability when registering mdev types. The
> > vendor driver calls device-specific interface to fill dirty info;
> > 3. otherwise pages pined by vfio_pin_pages (with WRITE flag) are
> > considered dirty. This covers normal mediated devices or using
> > fast-path mediation for migrating passthrough device;
> > 4. otherwise all mapped pages are considered dirty;
> >
> > Currently we're working on 1) based on VT-d rev3.0. I know some
> > vendors implement 2) in their own code base. 3) has real usages
> > already. 4) is the fall-back.
> >
> > Alex, are you willing to have all the interfaces ready in one batch,
> > or support them based on available usages? I'm fine with either
> > way, but even just doing 3/4 in this series, I'd prefer to having
> > above scheme included in the code comment, to give the whole
> > picture of all possible situations. :-)
> 
> My intention was to cover 3 & 4 given the current state of device and
> IOMMU dirty tracking.  I expect the user interface to remain unchanged
> for 1 & 2 but the API between vfio, the vendor drivers, and the IOMMU
> is internal to the kernel and more flexible.  Thanks,
> 

Sure. I also don't expect any change to user space API when extending
to 1 and 2. here our discussion is purely about kernel internal APIs. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-15  3:21               ` Alex Williamson
  2019-11-15  5:10                 ` Tian, Kevin
@ 2019-11-20  1:51                 ` Yan Zhao
  1 sibling, 0 replies; 46+ messages in thread
From: Yan Zhao @ 2019-11-20  1:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Fri, Nov 15, 2019 at 11:21:33AM +0800, Alex Williamson wrote:
> On Thu, 14 Nov 2019 21:40:35 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
> > > On Fri, 15 Nov 2019 00:26:07 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > > > On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> > > > > On Thu, 14 Nov 2019 01:07:21 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >     
> > > > >> On 11/13/2019 4:00 AM, Alex Williamson wrote:    
> > > > >>> On Tue, 12 Nov 2019 22:33:37 +0530
> > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >>>        
> > > > >>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> > > > >>>> considered as dirty during migration. IOMMU container maintains a list of
> > > > >>>> all such pinned pages. Added an ioctl defination to get bitmap of such    
> > > > >>>
> > > > >>> definition
> > > > >>>        
> > > > >>>> pinned pages for requested IO virtual address range.    
> > > > >>>
> > > > >>> Additionally, all mapped pages are considered dirty when physically
> > > > >>> mapped through to an IOMMU, modulo we discussed devices opting in to
> > > > >>> per page pinning to indicate finer granularity with a TBD mechanism to
> > > > >>> figure out if any non-opt-in devices remain.
> > > > >>>        
> > > > >>
> > > > >> You mean, in case of device direct assignment (device pass through)?    
> > > > > 
> > > > > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> > > > > pinned and mapped, then the correct dirty page set is all mapped pages.
> > > > > We discussed using the vpfn list as a mechanism for vendor drivers to
> > > > > reduce their migration footprint, but we also discussed that we would
> > > > > need a way to determine that all participants in the container have
> > > > > explicitly pinned their working pages or else we must consider the
> > > > > entire potential working set as dirty.
> > > > >     
> > > > 
> > > > How can vendor driver tell this capability to iommu module? Any suggestions?  
> > > 
> > > I think it does so by pinning pages.  Is it acceptable that if the
> > > vendor driver pins any pages, then from that point forward we consider
> > > the IOMMU group dirty page scope to be limited to pinned pages?  There
> > > are complications around non-singleton IOMMU groups, but I think we're
> > > already leaning towards that being a non-worthwhile problem to solve.
> > > So if we require that only singleton IOMMU groups can pin pages and we
> > > pass the IOMMU group as a parameter to
> > > vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
> > > flag on its local vfio_group struct to indicate dirty page scope is
> > > limited to pinned pages.  We might want to keep a flag on the
> > > vfio_iommu struct to indicate if all of the vfio_groups for each
> > > vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
> > > pinned pages as an optimization to avoid walking lists too often.  Then
> > > we could test if vfio_iommu.domain_list is not empty and this new flag
> > > does not limit the dirty page scope, then everything within each
> > > vfio_dma is considered dirty.
> > >  
> > 
> > hi Alex
> > could you help clarify whether my understandings below are right?
> > In future,
> > 1. for mdev and for passthrough device withoug hardware ability to track
> > dirty pages, the vendor driver has to explicitly call
> > vfio_pin_pages()/vfio_unpin_pages() + a flag to tell vfio its dirty page set.
> 
> For non-IOMMU backed mdevs without hardware dirty page tracking,
> there's no change to the vendor driver currently.  Pages pinned by the
> vendor driver are marked as dirty.
> 
> For any IOMMU backed device, mdev or direct assignment, all mapped
> memory would be considered dirty unless there are explicit calls to pin
> pages on top of the IOMMU page pinning and mapping.  These would likely
> be enabled only when the device is in the _SAVING device_state.
> 
> > 2. for those devices with hardware ability to track dirty pages, will still
> > provide a callback to vendor driver to get dirty pages. (as for those devices,
> > it is hard to explicitly call vfio_pin_pages()/vfio_unpin_pages())
> >
> > 3. for devices relying on dirty bit info in physical IOMMU, there
> > will be a callback to physical IOMMU driver to get dirty page set from
> > vfio.
> 
> The proposal here does not cover exactly how these would be
> implemented, it only establishes the container as the point of user
> interaction with the dirty bitmap and hopefully allows us to maintain
> that interface regardless of whether we have dirty tracking at the
> device or the system IOMMU.  Ideally devices with dirty tracking would
> make use of page pinning and we'd extend the interface to allow vendor
> drivers the ability to indicate the clean/dirty state of those pinned
> pages.  For system IOMMU dirty page tracking, that potentially might
> mean that we support IOMMU page faults and the container manages those
> faults such that the container is the central record of dirty pages.
> Until these interfaces are designed, we can only speculate, but the
> goal is to design a user interface compatible with how those features
> might evolve.  If you identify something that can't work, please raise
> the issue.  Thanks,
> 
> Alex
>
hi Alex
I think there are two downsides of centralizing dirty page tracking into
vfio container.
1. vendor driver has to report dirty pages to vfio container immediately
after it detects a dirty page.
It lost the freedom to record dirty pages in whatever way and query it on
receiving log_sync call.
e.g. it can record dirty page info in its internal hardware buffer or
record in hardware IOMMU and ask for that by itself.

2. the vfio container, if based on pin_pages, only knows which pages are
dirty or not dirty, but don't know incremental information. That's why
in Kirti's QEMU implementation only query dirty pages after stopping
device, right? but if in that way, QEMU migration may generate a wrong
downtime expectation and cause a longer downtime. E.g.  before stopping
device, QEMU thinks there's only 100M data left and can limit
migration downtime to certain value. but after stopping device and query
dirty pages again, it finds out there're actually 1000M more...

That's more concerns to it.

Thanks
Yan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-14 21:06           ` Alex Williamson
  2019-11-15  2:40             ` Yan Zhao
@ 2019-11-26  0:57             ` Yan Zhao
  2019-12-03 18:04               ` Alex Williamson
  1 sibling, 1 reply; 46+ messages in thread
From: Yan Zhao @ 2019-11-26  0:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
> On Fri, 15 Nov 2019 00:26:07 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 11/14/2019 1:37 AM, Alex Williamson wrote:
> > > On Thu, 14 Nov 2019 01:07:21 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > >> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> > >>> On Tue, 12 Nov 2019 22:33:37 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>      
> > >>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> > >>>> considered as dirty during migration. IOMMU container maintains a list of
> > >>>> all such pinned pages. Added an ioctl defination to get bitmap of such  
> > >>>
> > >>> definition
> > >>>      
> > >>>> pinned pages for requested IO virtual address range.  
> > >>>
> > >>> Additionally, all mapped pages are considered dirty when physically
> > >>> mapped through to an IOMMU, modulo we discussed devices opting in to
> > >>> per page pinning to indicate finer granularity with a TBD mechanism to
> > >>> figure out if any non-opt-in devices remain.
> > >>>      
> > >>
> > >> You mean, in case of device direct assignment (device pass through)?  
> > > 
> > > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> > > pinned and mapped, then the correct dirty page set is all mapped pages.
> > > We discussed using the vpfn list as a mechanism for vendor drivers to
> > > reduce their migration footprint, but we also discussed that we would
> > > need a way to determine that all participants in the container have
> > > explicitly pinned their working pages or else we must consider the
> > > entire potential working set as dirty.
> > >   
> > 
> > How can vendor driver tell this capability to iommu module? Any suggestions?
> 
> I think it does so by pinning pages.  Is it acceptable that if the
> vendor driver pins any pages, then from that point forward we consider
> the IOMMU group dirty page scope to be limited to pinned pages?  There
we should also be aware of that dirty page scope is pinned pages + unpinned pages,
which means ever since a page is pinned, it should be regarded as dirty
no matter whether it's unpinned later. only after log_sync is called and
dirty info retrieved, its dirty state should be cleared.

> are complications around non-singleton IOMMU groups, but I think we're
> already leaning towards that being a non-worthwhile problem to solve.
> So if we require that only singleton IOMMU groups can pin pages and we
> pass the IOMMU group as a parameter to
> vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
> flag on its local vfio_group struct to indicate dirty page scope is
> limited to pinned pages.  We might want to keep a flag on the
> vfio_iommu struct to indicate if all of the vfio_groups for each
> vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
> pinned pages as an optimization to avoid walking lists too often.  Then
> we could test if vfio_iommu.domain_list is not empty and this new flag
> does not limit the dirty page scope, then everything within each
> vfio_dma is considered dirty.
>  
> > >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >>>> ---
> > >>>>    include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
> > >>>>    1 file changed, 23 insertions(+)
> > >>>>
> > >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > >>>> index 35b09427ad9f..6fd3822aa610 100644
> > >>>> --- a/include/uapi/linux/vfio.h
> > >>>> +++ b/include/uapi/linux/vfio.h
> > >>>> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
> > >>>>    #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > >>>>    #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > >>>>    
> > >>>> +/**
> > >>>> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
> > >>>> + *                                     struct vfio_iommu_type1_dirty_bitmap)
> > >>>> + *
> > >>>> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
> > >>>> + * Get dirty pages bitmap of given IO virtual addresses range using
> > >>>> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
> > >>>> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
> > >>>> + * bitmap and should set size of allocated memory in bitmap_size field.
> > >>>> + * One bit is used to represent per page consecutively starting from iova
> > >>>> + * offset. Bit set indicates page at that offset from iova is dirty.
> > >>>> + */
> > >>>> +struct vfio_iommu_type1_dirty_bitmap {
> > >>>> +	__u32        argsz;
> > >>>> +	__u32        flags;
> > >>>> +	__u64        iova;                      /* IO virtual address */
> > >>>> +	__u64        size;                      /* Size of iova range */
> > >>>> +	__u64        bitmap_size;               /* in bytes */  
> > >>>
> > >>> This seems redundant.  We can calculate the size of the bitmap based on
> > >>> the iova size.
> > >>>     
> > >>
> > >> But in kernel space, we need to validate the size of memory allocated by
> > >> user instead of assuming user is always correct, right?  
> > > 
> > > What does it buy us for the user to tell us the size?  They could be
> > > wrong, they could be malicious.  The argsz field on the ioctl is mostly
> > > for the handshake that the user is competent, we should get faults from
> > > the copy-user operation if it's incorrect.
> > >  
> > 
> > It is to mainly fail safe.
> > 
> > >>>> +	void __user *bitmap;                    /* one bit per page */  
> > >>>
> > >>> Should we define that as a __u64* to (a) help with the size
> > >>> calculation, and (b) assure that we can use 8-byte ops on it?
> > >>>
> > >>> However, who defines page size?  Is it necessarily the processor page
> > >>> size?  A physical IOMMU may support page sizes other than the CPU page
> > >>> size.  It might be more important to indicate the expected page size
> > >>> than the bitmap size.  Thanks,
> > >>>     
> > >>
> > >> I see in QEMU and in vfio_iommu_type1 module, page sizes considered for
> > >> mapping are CPU page size, 4K. Do we still need to have such argument?  
> > > 
> > > That assumption exists for backwards compatibility prior to supporting
> > > the iova_pgsizes field in vfio_iommu_type1_info.  AFAIK the current
> > > interface has no page size assumptions and we should not add any.  
> > 
> > So userspace has iova_pgsizes information, which can be input to this 
> > ioctl. Bitmap should be considering smallest page size. Does that makes 
> > sense?
> 
> I'm not sure.  I thought I had an argument that the iova_pgsize could
> indicate support for sizes smaller than the processor page size, which
> would make the user responsible for using a different base for their
> page size, but vfio_pgsize_bitmap() already masks out sub-page sizes.
> Clearly the vendor driver is pinning based on processor sized pages,
> but that's independent of an IOMMU and not part of a user ABI.
> 
> I'm tempted to say your bitmap_size field has a use here, but it seems
> to fail in validating the user page size at the low extremes.  For
> example if we have a single page mapping, the user can specify the iova
> size as 4K (for example), but the minimum bitmap_size they can indicate
> is 1 byte, would we therefore assume the user's bitmap page size is 512
> bytes (ie. they provided us with 8 bits to describe a 4K range)?  We'd
> need to be careful to specify that the minimum iova_pgsize indicated
> page size is our lower bound as well.  But then what do we do if the
> user provides us with a smaller buffer than we expect?  For example, a
> 128MB iova range and only an 8-byte buffer.  Do we go ahead and assume
> a 2MB page size and fill the bitmap accordingly or do we generate an
> error?  If the latter, might we support that at some point in time and
> is it sufficient to let the user perform trial and error to test if that
> exists?  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-11-26  0:57             ` Yan Zhao
@ 2019-12-03 18:04               ` Alex Williamson
  2019-12-04 18:10                 ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-12-03 18:04 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Mon, 25 Nov 2019 19:57:39 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
> > On Fri, 15 Nov 2019 00:26:07 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> > > > On Thu, 14 Nov 2019 01:07:21 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >     
> > > >> On 11/13/2019 4:00 AM, Alex Williamson wrote:    
> > > >>> On Tue, 12 Nov 2019 22:33:37 +0530
> > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >>>        
> > > >>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> > > >>>> considered as dirty during migration. IOMMU container maintains a list of
> > > >>>> all such pinned pages. Added an ioctl defination to get bitmap of such    
> > > >>>
> > > >>> definition
> > > >>>        
> > > >>>> pinned pages for requested IO virtual address range.    
> > > >>>
> > > >>> Additionally, all mapped pages are considered dirty when physically
> > > >>> mapped through to an IOMMU, modulo we discussed devices opting in to
> > > >>> per page pinning to indicate finer granularity with a TBD mechanism to
> > > >>> figure out if any non-opt-in devices remain.
> > > >>>        
> > > >>
> > > >> You mean, in case of device direct assignment (device pass through)?    
> > > > 
> > > > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> > > > pinned and mapped, then the correct dirty page set is all mapped pages.
> > > > We discussed using the vpfn list as a mechanism for vendor drivers to
> > > > reduce their migration footprint, but we also discussed that we would
> > > > need a way to determine that all participants in the container have
> > > > explicitly pinned their working pages or else we must consider the
> > > > entire potential working set as dirty.
> > > >     
> > > 
> > > How can vendor driver tell this capability to iommu module? Any suggestions?  
> > 
> > I think it does so by pinning pages.  Is it acceptable that if the
> > vendor driver pins any pages, then from that point forward we consider
> > the IOMMU group dirty page scope to be limited to pinned pages?  There  
> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
> which means ever since a page is pinned, it should be regarded as dirty
> no matter whether it's unpinned later. only after log_sync is called and
> dirty info retrieved, its dirty state should be cleared.

Yes, good point.  We can't just remove a vpfn when a page is unpinned
or else we'd lose information that the page potentially had been
dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
list and both the currently pinned vpfns and the dirty vpfns are walked
on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
The container would need to know that dirty tracking is enabled and
only manage the dirty vpfns list when necessary.  Thanks,

Alex
 
> > are complications around non-singleton IOMMU groups, but I think we're
> > already leaning towards that being a non-worthwhile problem to solve.
> > So if we require that only singleton IOMMU groups can pin pages and we
> > pass the IOMMU group as a parameter to
> > vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
> > flag on its local vfio_group struct to indicate dirty page scope is
> > limited to pinned pages.  We might want to keep a flag on the
> > vfio_iommu struct to indicate if all of the vfio_groups for each
> > vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
> > pinned pages as an optimization to avoid walking lists too often.  Then
> > we could test if vfio_iommu.domain_list is not empty and this new flag
> > does not limit the dirty page scope, then everything within each
> > vfio_dma is considered dirty.
> >    
> > > >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > >>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > >>>> ---
> > > >>>>    include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
> > > >>>>    1 file changed, 23 insertions(+)
> > > >>>>
> > > >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > >>>> index 35b09427ad9f..6fd3822aa610 100644
> > > >>>> --- a/include/uapi/linux/vfio.h
> > > >>>> +++ b/include/uapi/linux/vfio.h
> > > >>>> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
> > > >>>>    #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > > >>>>    #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > > >>>>    
> > > >>>> +/**
> > > >>>> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
> > > >>>> + *                                     struct vfio_iommu_type1_dirty_bitmap)
> > > >>>> + *
> > > >>>> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
> > > >>>> + * Get dirty pages bitmap of given IO virtual addresses range using
> > > >>>> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
> > > >>>> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
> > > >>>> + * bitmap and should set size of allocated memory in bitmap_size field.
> > > >>>> + * One bit is used to represent per page consecutively starting from iova
> > > >>>> + * offset. Bit set indicates page at that offset from iova is dirty.
> > > >>>> + */
> > > >>>> +struct vfio_iommu_type1_dirty_bitmap {
> > > >>>> +	__u32        argsz;
> > > >>>> +	__u32        flags;
> > > >>>> +	__u64        iova;                      /* IO virtual address */
> > > >>>> +	__u64        size;                      /* Size of iova range */
> > > >>>> +	__u64        bitmap_size;               /* in bytes */    
> > > >>>
> > > >>> This seems redundant.  We can calculate the size of the bitmap based on
> > > >>> the iova size.
> > > >>>       
> > > >>
> > > >> But in kernel space, we need to validate the size of memory allocated by
> > > >> user instead of assuming user is always correct, right?    
> > > > 
> > > > What does it buy us for the user to tell us the size?  They could be
> > > > wrong, they could be malicious.  The argsz field on the ioctl is mostly
> > > > for the handshake that the user is competent, we should get faults from
> > > > the copy-user operation if it's incorrect.
> > > >    
> > > 
> > > It is to mainly fail safe.
> > >   
> > > >>>> +	void __user *bitmap;                    /* one bit per page */    
> > > >>>
> > > >>> Should we define that as a __u64* to (a) help with the size
> > > >>> calculation, and (b) assure that we can use 8-byte ops on it?
> > > >>>
> > > >>> However, who defines page size?  Is it necessarily the processor page
> > > >>> size?  A physical IOMMU may support page sizes other than the CPU page
> > > >>> size.  It might be more important to indicate the expected page size
> > > >>> than the bitmap size.  Thanks,
> > > >>>       
> > > >>
> > > >> I see in QEMU and in vfio_iommu_type1 module, page sizes considered for
> > > >> mapping are CPU page size, 4K. Do we still need to have such argument?    
> > > > 
> > > > That assumption exists for backwards compatibility prior to supporting
> > > > the iova_pgsizes field in vfio_iommu_type1_info.  AFAIK the current
> > > > interface has no page size assumptions and we should not add any.    
> > > 
> > > So userspace has iova_pgsizes information, which can be input to this 
> > > ioctl. Bitmap should be considering smallest page size. Does that makes 
> > > sense?  
> > 
> > I'm not sure.  I thought I had an argument that the iova_pgsize could
> > indicate support for sizes smaller than the processor page size, which
> > would make the user responsible for using a different base for their
> > page size, but vfio_pgsize_bitmap() already masks out sub-page sizes.
> > Clearly the vendor driver is pinning based on processor sized pages,
> > but that's independent of an IOMMU and not part of a user ABI.
> > 
> > I'm tempted to say your bitmap_size field has a use here, but it seems
> > to fail in validating the user page size at the low extremes.  For
> > example if we have a single page mapping, the user can specify the iova
> > size as 4K (for example), but the minimum bitmap_size they can indicate
> > is 1 byte, would we therefore assume the user's bitmap page size is 512
> > bytes (ie. they provided us with 8 bits to describe a 4K range)?  We'd
> > need to be careful to specify that the minimum iova_pgsize indicated
> > page size is our lower bound as well.  But then what do we do if the
> > user provides us with a smaller buffer than we expect?  For example, a
> > 128MB iova range and only an 8-byte buffer.  Do we go ahead and assume
> > a 2MB page size and fill the bitmap accordingly or do we generate an
> > error?  If the latter, might we support that at some point in time and
> > is it sufficient to let the user perform trial and error to test if that
> > exists?  Thanks,
> > 
> > Alex
> >   
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-12-03 18:04               ` Alex Williamson
@ 2019-12-04 18:10                 ` Kirti Wankhede
  2019-12-04 18:34                   ` Alex Williamson
  0 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-12-04 18:10 UTC (permalink / raw)
  To: Alex Williamson, Yan Zhao
  Cc: cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu, Yi L,
	mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang,
	Zhi A, qemu-devel, kvm



On 12/3/2019 11:34 PM, Alex Williamson wrote:
> On Mon, 25 Nov 2019 19:57:39 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
>> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
>>> On Fri, 15 Nov 2019 00:26:07 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 11/14/2019 1:37 AM, Alex Williamson wrote:
>>>>> On Thu, 14 Nov 2019 01:07:21 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>      
>>>>>> On 11/13/2019 4:00 AM, Alex Williamson wrote:
>>>>>>> On Tue, 12 Nov 2019 22:33:37 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>         
>>>>>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
>>>>>>>> considered as dirty during migration. IOMMU container maintains a list of
>>>>>>>> all such pinned pages. Added an ioctl defination to get bitmap of such
>>>>>>>
>>>>>>> definition
>>>>>>>         
>>>>>>>> pinned pages for requested IO virtual address range.
>>>>>>>
>>>>>>> Additionally, all mapped pages are considered dirty when physically
>>>>>>> mapped through to an IOMMU, modulo we discussed devices opting in to
>>>>>>> per page pinning to indicate finer granularity with a TBD mechanism to
>>>>>>> figure out if any non-opt-in devices remain.
>>>>>>>         
>>>>>>
>>>>>> You mean, in case of device direct assignment (device pass through)?
>>>>>
>>>>> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
>>>>> pinned and mapped, then the correct dirty page set is all mapped pages.
>>>>> We discussed using the vpfn list as a mechanism for vendor drivers to
>>>>> reduce their migration footprint, but we also discussed that we would
>>>>> need a way to determine that all participants in the container have
>>>>> explicitly pinned their working pages or else we must consider the
>>>>> entire potential working set as dirty.
>>>>>      
>>>>
>>>> How can vendor driver tell this capability to iommu module? Any suggestions?
>>>
>>> I think it does so by pinning pages.  Is it acceptable that if the
>>> vendor driver pins any pages, then from that point forward we consider
>>> the IOMMU group dirty page scope to be limited to pinned pages?  There
>> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
>> which means ever since a page is pinned, it should be regarded as dirty
>> no matter whether it's unpinned later. only after log_sync is called and
>> dirty info retrieved, its dirty state should be cleared.
> 
> Yes, good point.  We can't just remove a vpfn when a page is unpinned
> or else we'd lose information that the page potentially had been
> dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> list and both the currently pinned vpfns and the dirty vpfns are walked
> on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> The container would need to know that dirty tracking is enabled and
> only manage the dirty vpfns list when necessary.  Thanks,
> 

If page is unpinned, then that page is available in free page pool for 
others to use, then how can we say that unpinned page has valid data?

If suppose, one driver A unpins a page and when driver B of some other 
device gets that page and he pins it, uses it, and then unpins it, then 
how can we say that page has valid data for driver A?

Can you give one example where unpinned page data is considered reliable 
and valid?

Thanks,
Kirti

> Alex
>   
>>> are complications around non-singleton IOMMU groups, but I think we're
>>> already leaning towards that being a non-worthwhile problem to solve.
>>> So if we require that only singleton IOMMU groups can pin pages and we
>>> pass the IOMMU group as a parameter to
>>> vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
>>> flag on its local vfio_group struct to indicate dirty page scope is
>>> limited to pinned pages.  We might want to keep a flag on the
>>> vfio_iommu struct to indicate if all of the vfio_groups for each
>>> vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
>>> pinned pages as an optimization to avoid walking lists too often.  Then
>>> we could test if vfio_iommu.domain_list is not empty and this new flag
>>> does not limit the dirty page scope, then everything within each
>>> vfio_dma is considered dirty.
>>>     
>>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>>>> ---
>>>>>>>>     include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
>>>>>>>>     1 file changed, 23 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>>>>> index 35b09427ad9f..6fd3822aa610 100644
>>>>>>>> --- a/include/uapi/linux/vfio.h
>>>>>>>> +++ b/include/uapi/linux/vfio.h
>>>>>>>> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
>>>>>>>>     #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>>>>>>>>     #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>>>>>>>>     
>>>>>>>> +/**
>>>>>>>> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
>>>>>>>> + *                                     struct vfio_iommu_type1_dirty_bitmap)
>>>>>>>> + *
>>>>>>>> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
>>>>>>>> + * Get dirty pages bitmap of given IO virtual addresses range using
>>>>>>>> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
>>>>>>>> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
>>>>>>>> + * bitmap and should set size of allocated memory in bitmap_size field.
>>>>>>>> + * One bit is used to represent per page consecutively starting from iova
>>>>>>>> + * offset. Bit set indicates page at that offset from iova is dirty.
>>>>>>>> + */
>>>>>>>> +struct vfio_iommu_type1_dirty_bitmap {
>>>>>>>> +	__u32        argsz;
>>>>>>>> +	__u32        flags;
>>>>>>>> +	__u64        iova;                      /* IO virtual address */
>>>>>>>> +	__u64        size;                      /* Size of iova range */
>>>>>>>> +	__u64        bitmap_size;               /* in bytes */
>>>>>>>
>>>>>>> This seems redundant.  We can calculate the size of the bitmap based on
>>>>>>> the iova size.
>>>>>>>        
>>>>>>
>>>>>> But in kernel space, we need to validate the size of memory allocated by
>>>>>> user instead of assuming user is always correct, right?
>>>>>
>>>>> What does it buy us for the user to tell us the size?  They could be
>>>>> wrong, they could be malicious.  The argsz field on the ioctl is mostly
>>>>> for the handshake that the user is competent, we should get faults from
>>>>> the copy-user operation if it's incorrect.
>>>>>     
>>>>
>>>> It is to mainly fail safe.
>>>>    
>>>>>>>> +	void __user *bitmap;                    /* one bit per page */
>>>>>>>
>>>>>>> Should we define that as a __u64* to (a) help with the size
>>>>>>> calculation, and (b) assure that we can use 8-byte ops on it?
>>>>>>>
>>>>>>> However, who defines page size?  Is it necessarily the processor page
>>>>>>> size?  A physical IOMMU may support page sizes other than the CPU page
>>>>>>> size.  It might be more important to indicate the expected page size
>>>>>>> than the bitmap size.  Thanks,
>>>>>>>        
>>>>>>
>>>>>> I see in QEMU and in vfio_iommu_type1 module, page sizes considered for
>>>>>> mapping are CPU page size, 4K. Do we still need to have such argument?
>>>>>
>>>>> That assumption exists for backwards compatibility prior to supporting
>>>>> the iova_pgsizes field in vfio_iommu_type1_info.  AFAIK the current
>>>>> interface has no page size assumptions and we should not add any.
>>>>
>>>> So userspace has iova_pgsizes information, which can be input to this
>>>> ioctl. Bitmap should be considering smallest page size. Does that makes
>>>> sense?
>>>
>>> I'm not sure.  I thought I had an argument that the iova_pgsize could
>>> indicate support for sizes smaller than the processor page size, which
>>> would make the user responsible for using a different base for their
>>> page size, but vfio_pgsize_bitmap() already masks out sub-page sizes.
>>> Clearly the vendor driver is pinning based on processor sized pages,
>>> but that's independent of an IOMMU and not part of a user ABI.
>>>
>>> I'm tempted to say your bitmap_size field has a use here, but it seems
>>> to fail in validating the user page size at the low extremes.  For
>>> example if we have a single page mapping, the user can specify the iova
>>> size as 4K (for example), but the minimum bitmap_size they can indicate
>>> is 1 byte, would we therefore assume the user's bitmap page size is 512
>>> bytes (ie. they provided us with 8 bits to describe a 4K range)?  We'd
>>> need to be careful to specify that the minimum iova_pgsize indicated
>>> page size is our lower bound as well.  But then what do we do if the
>>> user provides us with a smaller buffer than we expect?  For example, a
>>> 128MB iova range and only an 8-byte buffer.  Do we go ahead and assume
>>> a 2MB page size and fill the bitmap accordingly or do we generate an
>>> error?  If the latter, might we support that at some point in time and
>>> is it sufficient to let the user perform trial and error to test if that
>>> exists?  Thanks,
>>>
>>> Alex
>>>    
>>
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-12-04 18:10                 ` Kirti Wankhede
@ 2019-12-04 18:34                   ` Alex Williamson
  2019-12-05  1:28                     ` Yan Zhao
  0 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-12-04 18:34 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Yan Zhao, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu,
	Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Wed, 4 Dec 2019 23:40:25 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/3/2019 11:34 PM, Alex Williamson wrote:
> > On Mon, 25 Nov 2019 19:57:39 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> >> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:  
> >>> On Fri, 15 Nov 2019 00:26:07 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> >>>>> On Thu, 14 Nov 2019 01:07:21 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>        
> >>>>>> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> >>>>>>> On Tue, 12 Nov 2019 22:33:37 +0530
> >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>           
> >>>>>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> >>>>>>>> considered as dirty during migration. IOMMU container maintains a list of
> >>>>>>>> all such pinned pages. Added an ioctl defination to get bitmap of such  
> >>>>>>>
> >>>>>>> definition
> >>>>>>>           
> >>>>>>>> pinned pages for requested IO virtual address range.  
> >>>>>>>
> >>>>>>> Additionally, all mapped pages are considered dirty when physically
> >>>>>>> mapped through to an IOMMU, modulo we discussed devices opting in to
> >>>>>>> per page pinning to indicate finer granularity with a TBD mechanism to
> >>>>>>> figure out if any non-opt-in devices remain.
> >>>>>>>           
> >>>>>>
> >>>>>> You mean, in case of device direct assignment (device pass through)?  
> >>>>>
> >>>>> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> >>>>> pinned and mapped, then the correct dirty page set is all mapped pages.
> >>>>> We discussed using the vpfn list as a mechanism for vendor drivers to
> >>>>> reduce their migration footprint, but we also discussed that we would
> >>>>> need a way to determine that all participants in the container have
> >>>>> explicitly pinned their working pages or else we must consider the
> >>>>> entire potential working set as dirty.
> >>>>>        
> >>>>
> >>>> How can vendor driver tell this capability to iommu module? Any suggestions?  
> >>>
> >>> I think it does so by pinning pages.  Is it acceptable that if the
> >>> vendor driver pins any pages, then from that point forward we consider
> >>> the IOMMU group dirty page scope to be limited to pinned pages?  There  
> >> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
> >> which means ever since a page is pinned, it should be regarded as dirty
> >> no matter whether it's unpinned later. only after log_sync is called and
> >> dirty info retrieved, its dirty state should be cleared.  
> > 
> > Yes, good point.  We can't just remove a vpfn when a page is unpinned
> > or else we'd lose information that the page potentially had been
> > dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> > list and both the currently pinned vpfns and the dirty vpfns are walked
> > on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> > The container would need to know that dirty tracking is enabled and
> > only manage the dirty vpfns list when necessary.  Thanks,
> >   
> 
> If page is unpinned, then that page is available in free page pool for 
> others to use, then how can we say that unpinned page has valid data?
> 
> If suppose, one driver A unpins a page and when driver B of some other 
> device gets that page and he pins it, uses it, and then unpins it, then 
> how can we say that page has valid data for driver A?
> 
> Can you give one example where unpinned page data is considered reliable 
> and valid?

We can only pin pages that the user has already allocated* and mapped
through the vfio DMA API.  The pinning of the page simply locks the
page for the vendor driver to access it and unpinning that page only
indicates that access is complete.  Pages are not freed when a vendor
driver unpins them, they still exist and at this point we're now
assuming the device dirtied the page while it was pinned.  Thanks,

Alex

* An exception here is that the page might be demand allocated and the
  act of pinning the page could actually allocate the backing page for
  the user if they have not faulted the page to trigger that allocation
  previously.  That page remains mapped for the user's virtual address
  space even after the unpinning though.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-12-04 18:34                   ` Alex Williamson
@ 2019-12-05  1:28                     ` Yan Zhao
  2019-12-05  5:42                       ` Kirti Wankhede
  0 siblings, 1 reply; 46+ messages in thread
From: Yan Zhao @ 2019-12-05  1:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:
> On Wed, 4 Dec 2019 23:40:25 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 12/3/2019 11:34 PM, Alex Williamson wrote:
> > > On Mon, 25 Nov 2019 19:57:39 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > >> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:  
> > >>> On Fri, 15 Nov 2019 00:26:07 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>      
> > >>>> On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> > >>>>> On Thu, 14 Nov 2019 01:07:21 +0530
> > >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>>>        
> > >>>>>> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> > >>>>>>> On Tue, 12 Nov 2019 22:33:37 +0530
> > >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>>>>>           
> > >>>>>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> > >>>>>>>> considered as dirty during migration. IOMMU container maintains a list of
> > >>>>>>>> all such pinned pages. Added an ioctl defination to get bitmap of such  
> > >>>>>>>
> > >>>>>>> definition
> > >>>>>>>           
> > >>>>>>>> pinned pages for requested IO virtual address range.  
> > >>>>>>>
> > >>>>>>> Additionally, all mapped pages are considered dirty when physically
> > >>>>>>> mapped through to an IOMMU, modulo we discussed devices opting in to
> > >>>>>>> per page pinning to indicate finer granularity with a TBD mechanism to
> > >>>>>>> figure out if any non-opt-in devices remain.
> > >>>>>>>           
> > >>>>>>
> > >>>>>> You mean, in case of device direct assignment (device pass through)?  
> > >>>>>
> > >>>>> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> > >>>>> pinned and mapped, then the correct dirty page set is all mapped pages.
> > >>>>> We discussed using the vpfn list as a mechanism for vendor drivers to
> > >>>>> reduce their migration footprint, but we also discussed that we would
> > >>>>> need a way to determine that all participants in the container have
> > >>>>> explicitly pinned their working pages or else we must consider the
> > >>>>> entire potential working set as dirty.
> > >>>>>        
> > >>>>
> > >>>> How can vendor driver tell this capability to iommu module? Any suggestions?  
> > >>>
> > >>> I think it does so by pinning pages.  Is it acceptable that if the
> > >>> vendor driver pins any pages, then from that point forward we consider
> > >>> the IOMMU group dirty page scope to be limited to pinned pages?  There  
> > >> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
> > >> which means ever since a page is pinned, it should be regarded as dirty
> > >> no matter whether it's unpinned later. only after log_sync is called and
> > >> dirty info retrieved, its dirty state should be cleared.  
> > > 
> > > Yes, good point.  We can't just remove a vpfn when a page is unpinned
> > > or else we'd lose information that the page potentially had been
> > > dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> > > list and both the currently pinned vpfns and the dirty vpfns are walked
> > > on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> > > The container would need to know that dirty tracking is enabled and
> > > only manage the dirty vpfns list when necessary.  Thanks,
> > >   
> > 
> > If page is unpinned, then that page is available in free page pool for 
> > others to use, then how can we say that unpinned page has valid data?
> > 
> > If suppose, one driver A unpins a page and when driver B of some other 
> > device gets that page and he pins it, uses it, and then unpins it, then 
> > how can we say that page has valid data for driver A?
> > 
> > Can you give one example where unpinned page data is considered reliable 
> > and valid?
> 
> We can only pin pages that the user has already allocated* and mapped
> through the vfio DMA API.  The pinning of the page simply locks the
> page for the vendor driver to access it and unpinning that page only
> indicates that access is complete.  Pages are not freed when a vendor
> driver unpins them, they still exist and at this point we're now
> assuming the device dirtied the page while it was pinned.  Thanks,
> 
> Alex
> 
> * An exception here is that the page might be demand allocated and the
>   act of pinning the page could actually allocate the backing page for
>   the user if they have not faulted the page to trigger that allocation
>   previously.  That page remains mapped for the user's virtual address
>   space even after the unpinning though.
>

Yes, I can give an example in GVT.
when a gem_object is allocated in guest, before submitting it to guest
vGPU, gfx cmds in its ring buffer need to be pinned into GGTT to get a
global graphics address for hardware access. At that time, we shadow
those cmds and pin pages through vfio pin_pages(), and submit the shadow
gem_object to physial hardware.
After guest driver thinks the submitted gem_object has completed hardware
DMA, it unnpinnd those pinned GGTT graphics memory addresses. Then in
host, we unpin the shadow pages through vfio unpin_pages.
But, at this point, guest driver is still free to access the gem_object
through vCPUs, and guest user space is probably still mapping an object
into the gem_object in guest driver.
So, missing the dirty page tracking for unpinned pages would cause
data inconsitency.

Thanks
Yan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-12-05  1:28                     ` Yan Zhao
@ 2019-12-05  5:42                       ` Kirti Wankhede
  2019-12-05  5:47                         ` Yan Zhao
  2019-12-05  5:56                         ` Alex Williamson
  0 siblings, 2 replies; 46+ messages in thread
From: Kirti Wankhede @ 2019-12-05  5:42 UTC (permalink / raw)
  To: Yan Zhao, Alex Williamson
  Cc: cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu, Yi L,
	mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, Wang,
	Zhi A, qemu-devel, kvm



On 12/5/2019 6:58 AM, Yan Zhao wrote:
> On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:
>> On Wed, 4 Dec 2019 23:40:25 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>>> On 12/3/2019 11:34 PM, Alex Williamson wrote:
>>>> On Mon, 25 Nov 2019 19:57:39 -0500
>>>> Yan Zhao <yan.y.zhao@intel.com> wrote:
>>>>    
>>>>> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
>>>>>> On Fri, 15 Nov 2019 00:26:07 +0530
>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>       
>>>>>>> On 11/14/2019 1:37 AM, Alex Williamson wrote:
>>>>>>>> On Thu, 14 Nov 2019 01:07:21 +0530
>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>         
>>>>>>>>> On 11/13/2019 4:00 AM, Alex Williamson wrote:
>>>>>>>>>> On Tue, 12 Nov 2019 22:33:37 +0530
>>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>>            
>>>>>>>>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
>>>>>>>>>>> considered as dirty during migration. IOMMU container maintains a list of
>>>>>>>>>>> all such pinned pages. Added an ioctl defination to get bitmap of such
>>>>>>>>>>
>>>>>>>>>> definition
>>>>>>>>>>            
>>>>>>>>>>> pinned pages for requested IO virtual address range.
>>>>>>>>>>
>>>>>>>>>> Additionally, all mapped pages are considered dirty when physically
>>>>>>>>>> mapped through to an IOMMU, modulo we discussed devices opting in to
>>>>>>>>>> per page pinning to indicate finer granularity with a TBD mechanism to
>>>>>>>>>> figure out if any non-opt-in devices remain.
>>>>>>>>>>            
>>>>>>>>>
>>>>>>>>> You mean, in case of device direct assignment (device pass through)?
>>>>>>>>
>>>>>>>> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
>>>>>>>> pinned and mapped, then the correct dirty page set is all mapped pages.
>>>>>>>> We discussed using the vpfn list as a mechanism for vendor drivers to
>>>>>>>> reduce their migration footprint, but we also discussed that we would
>>>>>>>> need a way to determine that all participants in the container have
>>>>>>>> explicitly pinned their working pages or else we must consider the
>>>>>>>> entire potential working set as dirty.
>>>>>>>>         
>>>>>>>
>>>>>>> How can vendor driver tell this capability to iommu module? Any suggestions?
>>>>>>
>>>>>> I think it does so by pinning pages.  Is it acceptable that if the
>>>>>> vendor driver pins any pages, then from that point forward we consider
>>>>>> the IOMMU group dirty page scope to be limited to pinned pages?  There
>>>>> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
>>>>> which means ever since a page is pinned, it should be regarded as dirty
>>>>> no matter whether it's unpinned later. only after log_sync is called and
>>>>> dirty info retrieved, its dirty state should be cleared.
>>>>
>>>> Yes, good point.  We can't just remove a vpfn when a page is unpinned
>>>> or else we'd lose information that the page potentially had been
>>>> dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
>>>> list and both the currently pinned vpfns and the dirty vpfns are walked
>>>> on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
>>>> The container would need to know that dirty tracking is enabled and
>>>> only manage the dirty vpfns list when necessary.  Thanks,
>>>>    
>>>
>>> If page is unpinned, then that page is available in free page pool for
>>> others to use, then how can we say that unpinned page has valid data?
>>>
>>> If suppose, one driver A unpins a page and when driver B of some other
>>> device gets that page and he pins it, uses it, and then unpins it, then
>>> how can we say that page has valid data for driver A?
>>>
>>> Can you give one example where unpinned page data is considered reliable
>>> and valid?
>>
>> We can only pin pages that the user has already allocated* and mapped
>> through the vfio DMA API.  The pinning of the page simply locks the
>> page for the vendor driver to access it and unpinning that page only
>> indicates that access is complete.  Pages are not freed when a vendor
>> driver unpins them, they still exist and at this point we're now
>> assuming the device dirtied the page while it was pinned.  Thanks,
>>
>> Alex
>>
>> * An exception here is that the page might be demand allocated and the
>>    act of pinning the page could actually allocate the backing page for
>>    the user if they have not faulted the page to trigger that allocation
>>    previously.  That page remains mapped for the user's virtual address
>>    space even after the unpinning though.
>>
> 
> Yes, I can give an example in GVT.
> when a gem_object is allocated in guest, before submitting it to guest
> vGPU, gfx cmds in its ring buffer need to be pinned into GGTT to get a
> global graphics address for hardware access. At that time, we shadow
> those cmds and pin pages through vfio pin_pages(), and submit the shadow
> gem_object to physial hardware.
> After guest driver thinks the submitted gem_object has completed hardware
> DMA, it unnpinnd those pinned GGTT graphics memory addresses. Then in
> host, we unpin the shadow pages through vfio unpin_pages.
> But, at this point, guest driver is still free to access the gem_object
> through vCPUs, and guest user space is probably still mapping an object
> into the gem_object in guest driver.
> So, missing the dirty page tracking for unpinned pages would cause
> data inconsitency.
> 

If pages are accessed by guest through vCPUs, then RAM module in QEMU 
will take care of tracking those pages as dirty.

All unpinned pages might not be used, so tracking all unpinned pages 
during VM or application life time would also lead to tracking lots of 
stale pages, even though they are not being used. Increasing number of 
not needed pages could also lead to increasing migration data leading 
increase in migration downtime.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-12-05  5:42                       ` Kirti Wankhede
@ 2019-12-05  5:47                         ` Yan Zhao
  2019-12-05  5:56                         ` Alex Williamson
  1 sibling, 0 replies; 46+ messages in thread
From: Yan Zhao @ 2019-12-05  5:47 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Alex Williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Thu, Dec 05, 2019 at 01:42:23PM +0800, Kirti Wankhede wrote:
> 
> 
> On 12/5/2019 6:58 AM, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:
> >> On Wed, 4 Dec 2019 23:40:25 +0530
> >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>
> >>> On 12/3/2019 11:34 PM, Alex Williamson wrote:
> >>>> On Mon, 25 Nov 2019 19:57:39 -0500
> >>>> Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>>>    
> >>>>> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
> >>>>>> On Fri, 15 Nov 2019 00:26:07 +0530
> >>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>       
> >>>>>>> On 11/14/2019 1:37 AM, Alex Williamson wrote:
> >>>>>>>> On Thu, 14 Nov 2019 01:07:21 +0530
> >>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>>         
> >>>>>>>>> On 11/13/2019 4:00 AM, Alex Williamson wrote:
> >>>>>>>>>> On Tue, 12 Nov 2019 22:33:37 +0530
> >>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>>>>            
> >>>>>>>>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> >>>>>>>>>>> considered as dirty during migration. IOMMU container maintains a list of
> >>>>>>>>>>> all such pinned pages. Added an ioctl defination to get bitmap of such
> >>>>>>>>>>
> >>>>>>>>>> definition
> >>>>>>>>>>            
> >>>>>>>>>>> pinned pages for requested IO virtual address range.
> >>>>>>>>>>
> >>>>>>>>>> Additionally, all mapped pages are considered dirty when physically
> >>>>>>>>>> mapped through to an IOMMU, modulo we discussed devices opting in to
> >>>>>>>>>> per page pinning to indicate finer granularity with a TBD mechanism to
> >>>>>>>>>> figure out if any non-opt-in devices remain.
> >>>>>>>>>>            
> >>>>>>>>>
> >>>>>>>>> You mean, in case of device direct assignment (device pass through)?
> >>>>>>>>
> >>>>>>>> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> >>>>>>>> pinned and mapped, then the correct dirty page set is all mapped pages.
> >>>>>>>> We discussed using the vpfn list as a mechanism for vendor drivers to
> >>>>>>>> reduce their migration footprint, but we also discussed that we would
> >>>>>>>> need a way to determine that all participants in the container have
> >>>>>>>> explicitly pinned their working pages or else we must consider the
> >>>>>>>> entire potential working set as dirty.
> >>>>>>>>         
> >>>>>>>
> >>>>>>> How can vendor driver tell this capability to iommu module? Any suggestions?
> >>>>>>
> >>>>>> I think it does so by pinning pages.  Is it acceptable that if the
> >>>>>> vendor driver pins any pages, then from that point forward we consider
> >>>>>> the IOMMU group dirty page scope to be limited to pinned pages?  There
> >>>>> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
> >>>>> which means ever since a page is pinned, it should be regarded as dirty
> >>>>> no matter whether it's unpinned later. only after log_sync is called and
> >>>>> dirty info retrieved, its dirty state should be cleared.
> >>>>
> >>>> Yes, good point.  We can't just remove a vpfn when a page is unpinned
> >>>> or else we'd lose information that the page potentially had been
> >>>> dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> >>>> list and both the currently pinned vpfns and the dirty vpfns are walked
> >>>> on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> >>>> The container would need to know that dirty tracking is enabled and
> >>>> only manage the dirty vpfns list when necessary.  Thanks,
> >>>>    
> >>>
> >>> If page is unpinned, then that page is available in free page pool for
> >>> others to use, then how can we say that unpinned page has valid data?
> >>>
> >>> If suppose, one driver A unpins a page and when driver B of some other
> >>> device gets that page and he pins it, uses it, and then unpins it, then
> >>> how can we say that page has valid data for driver A?
> >>>
> >>> Can you give one example where unpinned page data is considered reliable
> >>> and valid?
> >>
> >> We can only pin pages that the user has already allocated* and mapped
> >> through the vfio DMA API.  The pinning of the page simply locks the
> >> page for the vendor driver to access it and unpinning that page only
> >> indicates that access is complete.  Pages are not freed when a vendor
> >> driver unpins them, they still exist and at this point we're now
> >> assuming the device dirtied the page while it was pinned.  Thanks,
> >>
> >> Alex
> >>
> >> * An exception here is that the page might be demand allocated and the
> >>    act of pinning the page could actually allocate the backing page for
> >>    the user if they have not faulted the page to trigger that allocation
> >>    previously.  That page remains mapped for the user's virtual address
> >>    space even after the unpinning though.
> >>
> > 
> > Yes, I can give an example in GVT.
> > when a gem_object is allocated in guest, before submitting it to guest
> > vGPU, gfx cmds in its ring buffer need to be pinned into GGTT to get a
> > global graphics address for hardware access. At that time, we shadow
> > those cmds and pin pages through vfio pin_pages(), and submit the shadow
> > gem_object to physial hardware.
> > After guest driver thinks the submitted gem_object has completed hardware
> > DMA, it unnpinnd those pinned GGTT graphics memory addresses. Then in
> > host, we unpin the shadow pages through vfio unpin_pages.
> > But, at this point, guest driver is still free to access the gem_object
> > through vCPUs, and guest user space is probably still mapping an object
> > into the gem_object in guest driver.
> > So, missing the dirty page tracking for unpinned pages would cause
> > data inconsitency.
> > 
> 
> If pages are accessed by guest through vCPUs, then RAM module in QEMU 
> will take care of tracking those pages as dirty.
> 
> All unpinned pages might not be used, so tracking all unpinned pages 
> during VM or application life time would also lead to tracking lots of 
> stale pages, even though they are not being used. Increasing number of 
> not needed pages could also lead to increasing migration data leading 
> increase in migration downtime.
> 
> Thanks,
> Kirti

Those are pages dirtied by vGPU during Pin/Unpin. They are not dirtied
by vCPUs. RAM module in QEMU has no idea of it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-12-05  5:42                       ` Kirti Wankhede
  2019-12-05  5:47                         ` Yan Zhao
@ 2019-12-05  5:56                         ` Alex Williamson
  2019-12-05  6:19                           ` Kirti Wankhede
  1 sibling, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2019-12-05  5:56 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Yan Zhao, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu,
	Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Thu, 5 Dec 2019 11:12:23 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/5/2019 6:58 AM, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:  
> >> On Wed, 4 Dec 2019 23:40:25 +0530
> >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>  
> >>> On 12/3/2019 11:34 PM, Alex Williamson wrote:  
> >>>> On Mon, 25 Nov 2019 19:57:39 -0500
> >>>> Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>>>      
> >>>>> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:  
> >>>>>> On Fri, 15 Nov 2019 00:26:07 +0530
> >>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>         
> >>>>>>> On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> >>>>>>>> On Thu, 14 Nov 2019 01:07:21 +0530
> >>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>>           
> >>>>>>>>> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> >>>>>>>>>> On Tue, 12 Nov 2019 22:33:37 +0530
> >>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>>>>              
> >>>>>>>>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> >>>>>>>>>>> considered as dirty during migration. IOMMU container maintains a list of
> >>>>>>>>>>> all such pinned pages. Added an ioctl defination to get bitmap of such  
> >>>>>>>>>>
> >>>>>>>>>> definition
> >>>>>>>>>>              
> >>>>>>>>>>> pinned pages for requested IO virtual address range.  
> >>>>>>>>>>
> >>>>>>>>>> Additionally, all mapped pages are considered dirty when physically
> >>>>>>>>>> mapped through to an IOMMU, modulo we discussed devices opting in to
> >>>>>>>>>> per page pinning to indicate finer granularity with a TBD mechanism to
> >>>>>>>>>> figure out if any non-opt-in devices remain.
> >>>>>>>>>>              
> >>>>>>>>>
> >>>>>>>>> You mean, in case of device direct assignment (device pass through)?  
> >>>>>>>>
> >>>>>>>> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> >>>>>>>> pinned and mapped, then the correct dirty page set is all mapped pages.
> >>>>>>>> We discussed using the vpfn list as a mechanism for vendor drivers to
> >>>>>>>> reduce their migration footprint, but we also discussed that we would
> >>>>>>>> need a way to determine that all participants in the container have
> >>>>>>>> explicitly pinned their working pages or else we must consider the
> >>>>>>>> entire potential working set as dirty.
> >>>>>>>>           
> >>>>>>>
> >>>>>>> How can vendor driver tell this capability to iommu module? Any suggestions?  
> >>>>>>
> >>>>>> I think it does so by pinning pages.  Is it acceptable that if the
> >>>>>> vendor driver pins any pages, then from that point forward we consider
> >>>>>> the IOMMU group dirty page scope to be limited to pinned pages?  There  
> >>>>> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
> >>>>> which means ever since a page is pinned, it should be regarded as dirty
> >>>>> no matter whether it's unpinned later. only after log_sync is called and
> >>>>> dirty info retrieved, its dirty state should be cleared.  
> >>>>
> >>>> Yes, good point.  We can't just remove a vpfn when a page is unpinned
> >>>> or else we'd lose information that the page potentially had been
> >>>> dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> >>>> list and both the currently pinned vpfns and the dirty vpfns are walked
> >>>> on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> >>>> The container would need to know that dirty tracking is enabled and
> >>>> only manage the dirty vpfns list when necessary.  Thanks,
> >>>>      
> >>>
> >>> If page is unpinned, then that page is available in free page pool for
> >>> others to use, then how can we say that unpinned page has valid data?
> >>>
> >>> If suppose, one driver A unpins a page and when driver B of some other
> >>> device gets that page and he pins it, uses it, and then unpins it, then
> >>> how can we say that page has valid data for driver A?
> >>>
> >>> Can you give one example where unpinned page data is considered reliable
> >>> and valid?  
> >>
> >> We can only pin pages that the user has already allocated* and mapped
> >> through the vfio DMA API.  The pinning of the page simply locks the
> >> page for the vendor driver to access it and unpinning that page only
> >> indicates that access is complete.  Pages are not freed when a vendor
> >> driver unpins them, they still exist and at this point we're now
> >> assuming the device dirtied the page while it was pinned.  Thanks,
> >>
> >> Alex
> >>
> >> * An exception here is that the page might be demand allocated and the
> >>    act of pinning the page could actually allocate the backing page for
> >>    the user if they have not faulted the page to trigger that allocation
> >>    previously.  That page remains mapped for the user's virtual address
> >>    space even after the unpinning though.
> >>  
> > 
> > Yes, I can give an example in GVT.
> > when a gem_object is allocated in guest, before submitting it to guest
> > vGPU, gfx cmds in its ring buffer need to be pinned into GGTT to get a
> > global graphics address for hardware access. At that time, we shadow
> > those cmds and pin pages through vfio pin_pages(), and submit the shadow
> > gem_object to physial hardware.
> > After guest driver thinks the submitted gem_object has completed hardware
> > DMA, it unnpinnd those pinned GGTT graphics memory addresses. Then in
> > host, we unpin the shadow pages through vfio unpin_pages.
> > But, at this point, guest driver is still free to access the gem_object
> > through vCPUs, and guest user space is probably still mapping an object
> > into the gem_object in guest driver.
> > So, missing the dirty page tracking for unpinned pages would cause
> > data inconsitency.
> >   
> 
> If pages are accessed by guest through vCPUs, then RAM module in QEMU 
> will take care of tracking those pages as dirty.
> 
> All unpinned pages might not be used, so tracking all unpinned pages 
> during VM or application life time would also lead to tracking lots of 
> stale pages, even though they are not being used. Increasing number of 
> not needed pages could also lead to increasing migration data leading 
> increase in migration downtime.

We can't rely on the vCPU also dirtying a page, the overhead is
unavoidable.  It doesn't matter if the migration is fast if it's
incorrect.  We only need to track unpinned dirty pages while the
migration is active and the tracking is flushed on each log_sync
callback.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-12-05  5:56                         ` Alex Williamson
@ 2019-12-05  6:19                           ` Kirti Wankhede
  2019-12-05  6:40                             ` Alex Williamson
  0 siblings, 1 reply; 46+ messages in thread
From: Kirti Wankhede @ 2019-12-05  6:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yan Zhao, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu,
	Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm



On 12/5/2019 11:26 AM, Alex Williamson wrote:
> On Thu, 5 Dec 2019 11:12:23 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 12/5/2019 6:58 AM, Yan Zhao wrote:
>>> On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:
>>>> On Wed, 4 Dec 2019 23:40:25 +0530
>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>   
>>>>> On 12/3/2019 11:34 PM, Alex Williamson wrote:
>>>>>> On Mon, 25 Nov 2019 19:57:39 -0500
>>>>>> Yan Zhao <yan.y.zhao@intel.com> wrote:
>>>>>>       
>>>>>>> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
>>>>>>>> On Fri, 15 Nov 2019 00:26:07 +0530
>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>          
>>>>>>>>> On 11/14/2019 1:37 AM, Alex Williamson wrote:
>>>>>>>>>> On Thu, 14 Nov 2019 01:07:21 +0530
>>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>>            
>>>>>>>>>>> On 11/13/2019 4:00 AM, Alex Williamson wrote:
>>>>>>>>>>>> On Tue, 12 Nov 2019 22:33:37 +0530
>>>>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>>>>               
>>>>>>>>>>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
>>>>>>>>>>>>> considered as dirty during migration. IOMMU container maintains a list of
>>>>>>>>>>>>> all such pinned pages. Added an ioctl defination to get bitmap of such
>>>>>>>>>>>>
>>>>>>>>>>>> definition
>>>>>>>>>>>>               
>>>>>>>>>>>>> pinned pages for requested IO virtual address range.
>>>>>>>>>>>>
>>>>>>>>>>>> Additionally, all mapped pages are considered dirty when physically
>>>>>>>>>>>> mapped through to an IOMMU, modulo we discussed devices opting in to
>>>>>>>>>>>> per page pinning to indicate finer granularity with a TBD mechanism to
>>>>>>>>>>>> figure out if any non-opt-in devices remain.
>>>>>>>>>>>>               
>>>>>>>>>>>
>>>>>>>>>>> You mean, in case of device direct assignment (device pass through)?
>>>>>>>>>>
>>>>>>>>>> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
>>>>>>>>>> pinned and mapped, then the correct dirty page set is all mapped pages.
>>>>>>>>>> We discussed using the vpfn list as a mechanism for vendor drivers to
>>>>>>>>>> reduce their migration footprint, but we also discussed that we would
>>>>>>>>>> need a way to determine that all participants in the container have
>>>>>>>>>> explicitly pinned their working pages or else we must consider the
>>>>>>>>>> entire potential working set as dirty.
>>>>>>>>>>            
>>>>>>>>>
>>>>>>>>> How can vendor driver tell this capability to iommu module? Any suggestions?
>>>>>>>>
>>>>>>>> I think it does so by pinning pages.  Is it acceptable that if the
>>>>>>>> vendor driver pins any pages, then from that point forward we consider
>>>>>>>> the IOMMU group dirty page scope to be limited to pinned pages?  There
>>>>>>> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
>>>>>>> which means ever since a page is pinned, it should be regarded as dirty
>>>>>>> no matter whether it's unpinned later. only after log_sync is called and
>>>>>>> dirty info retrieved, its dirty state should be cleared.
>>>>>>
>>>>>> Yes, good point.  We can't just remove a vpfn when a page is unpinned
>>>>>> or else we'd lose information that the page potentially had been
>>>>>> dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
>>>>>> list and both the currently pinned vpfns and the dirty vpfns are walked
>>>>>> on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
>>>>>> The container would need to know that dirty tracking is enabled and
>>>>>> only manage the dirty vpfns list when necessary.  Thanks,
>>>>>>       
>>>>>
>>>>> If page is unpinned, then that page is available in free page pool for
>>>>> others to use, then how can we say that unpinned page has valid data?
>>>>>
>>>>> If suppose, one driver A unpins a page and when driver B of some other
>>>>> device gets that page and he pins it, uses it, and then unpins it, then
>>>>> how can we say that page has valid data for driver A?
>>>>>
>>>>> Can you give one example where unpinned page data is considered reliable
>>>>> and valid?
>>>>
>>>> We can only pin pages that the user has already allocated* and mapped
>>>> through the vfio DMA API.  The pinning of the page simply locks the
>>>> page for the vendor driver to access it and unpinning that page only
>>>> indicates that access is complete.  Pages are not freed when a vendor
>>>> driver unpins them, they still exist and at this point we're now
>>>> assuming the device dirtied the page while it was pinned.  Thanks,
>>>>
>>>> Alex
>>>>
>>>> * An exception here is that the page might be demand allocated and the
>>>>     act of pinning the page could actually allocate the backing page for
>>>>     the user if they have not faulted the page to trigger that allocation
>>>>     previously.  That page remains mapped for the user's virtual address
>>>>     space even after the unpinning though.
>>>>   
>>>
>>> Yes, I can give an example in GVT.
>>> when a gem_object is allocated in guest, before submitting it to guest
>>> vGPU, gfx cmds in its ring buffer need to be pinned into GGTT to get a
>>> global graphics address for hardware access. At that time, we shadow
>>> those cmds and pin pages through vfio pin_pages(), and submit the shadow
>>> gem_object to physial hardware.
>>> After guest driver thinks the submitted gem_object has completed hardware
>>> DMA, it unnpinnd those pinned GGTT graphics memory addresses. Then in
>>> host, we unpin the shadow pages through vfio unpin_pages.
>>> But, at this point, guest driver is still free to access the gem_object
>>> through vCPUs, and guest user space is probably still mapping an object
>>> into the gem_object in guest driver.
>>> So, missing the dirty page tracking for unpinned pages would cause
>>> data inconsitency.
>>>    
>>
>> If pages are accessed by guest through vCPUs, then RAM module in QEMU
>> will take care of tracking those pages as dirty.
>>
>> All unpinned pages might not be used, so tracking all unpinned pages
>> during VM or application life time would also lead to tracking lots of
>> stale pages, even though they are not being used. Increasing number of
>> not needed pages could also lead to increasing migration data leading
>> increase in migration downtime.
> 
> We can't rely on the vCPU also dirtying a page, the overhead is
> unavoidable.  It doesn't matter if the migration is fast if it's
> incorrect.  We only need to track unpinned dirty pages while the
> migration is active and the tracking is flushed on each log_sync
> callback.  Thanks,
> 

 From Yan's comment, pasted below, I thought, need to track all unpinned 
pages during application or VM's lifetime.

 > There we should also be aware of that dirty page scope is pinned 
pages  > + unpinned pages, which means ever since a page is pinned, it 
should
 > be regarded as dirty no matter whether it's unpinned later.

But if its about tracking pages which are unpinned "while the migration 
is active", then that set would be less, will do this change.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
  2019-12-05  6:19                           ` Kirti Wankhede
@ 2019-12-05  6:40                             ` Alex Williamson
  0 siblings, 0 replies; 46+ messages in thread
From: Alex Williamson @ 2019-12-05  6:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Yan Zhao, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng, Liu,
	Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Thu, 5 Dec 2019 11:49:00 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/5/2019 11:26 AM, Alex Williamson wrote:
> > On Thu, 5 Dec 2019 11:12:23 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 12/5/2019 6:58 AM, Yan Zhao wrote:  
> >>> On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:  
> >>>> On Wed, 4 Dec 2019 23:40:25 +0530
> >>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>     
> >>>>> On 12/3/2019 11:34 PM, Alex Williamson wrote:  
> >>>>>> On Mon, 25 Nov 2019 19:57:39 -0500
> >>>>>> Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>>>>>         
> >>>>>>> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:  
> >>>>>>>> On Fri, 15 Nov 2019 00:26:07 +0530
> >>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>>            
> >>>>>>>>> On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> >>>>>>>>>> On Thu, 14 Nov 2019 01:07:21 +0530
> >>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>>>>              
> >>>>>>>>>>> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> >>>>>>>>>>>> On Tue, 12 Nov 2019 22:33:37 +0530
> >>>>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>>>>>>                 
> >>>>>>>>>>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
> >>>>>>>>>>>>> considered as dirty during migration. IOMMU container maintains a list of
> >>>>>>>>>>>>> all such pinned pages. Added an ioctl defination to get bitmap of such  
> >>>>>>>>>>>>
> >>>>>>>>>>>> definition
> >>>>>>>>>>>>                 
> >>>>>>>>>>>>> pinned pages for requested IO virtual address range.  
> >>>>>>>>>>>>
> >>>>>>>>>>>> Additionally, all mapped pages are considered dirty when physically
> >>>>>>>>>>>> mapped through to an IOMMU, modulo we discussed devices opting in to
> >>>>>>>>>>>> per page pinning to indicate finer granularity with a TBD mechanism to
> >>>>>>>>>>>> figure out if any non-opt-in devices remain.
> >>>>>>>>>>>>                 
> >>>>>>>>>>>
> >>>>>>>>>>> You mean, in case of device direct assignment (device pass through)?  
> >>>>>>>>>>
> >>>>>>>>>> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> >>>>>>>>>> pinned and mapped, then the correct dirty page set is all mapped pages.
> >>>>>>>>>> We discussed using the vpfn list as a mechanism for vendor drivers to
> >>>>>>>>>> reduce their migration footprint, but we also discussed that we would
> >>>>>>>>>> need a way to determine that all participants in the container have
> >>>>>>>>>> explicitly pinned their working pages or else we must consider the
> >>>>>>>>>> entire potential working set as dirty.
> >>>>>>>>>>              
> >>>>>>>>>
> >>>>>>>>> How can vendor driver tell this capability to iommu module? Any suggestions?  
> >>>>>>>>
> >>>>>>>> I think it does so by pinning pages.  Is it acceptable that if the
> >>>>>>>> vendor driver pins any pages, then from that point forward we consider
> >>>>>>>> the IOMMU group dirty page scope to be limited to pinned pages?  There  
> >>>>>>> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
> >>>>>>> which means ever since a page is pinned, it should be regarded as dirty
> >>>>>>> no matter whether it's unpinned later. only after log_sync is called and
> >>>>>>> dirty info retrieved, its dirty state should be cleared.  
> >>>>>>
> >>>>>> Yes, good point.  We can't just remove a vpfn when a page is unpinned
> >>>>>> or else we'd lose information that the page potentially had been
> >>>>>> dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> >>>>>> list and both the currently pinned vpfns and the dirty vpfns are walked
> >>>>>> on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> >>>>>> The container would need to know that dirty tracking is enabled and
> >>>>>> only manage the dirty vpfns list when necessary.  Thanks,
> >>>>>>         
> >>>>>
> >>>>> If page is unpinned, then that page is available in free page pool for
> >>>>> others to use, then how can we say that unpinned page has valid data?
> >>>>>
> >>>>> If suppose, one driver A unpins a page and when driver B of some other
> >>>>> device gets that page and he pins it, uses it, and then unpins it, then
> >>>>> how can we say that page has valid data for driver A?
> >>>>>
> >>>>> Can you give one example where unpinned page data is considered reliable
> >>>>> and valid?  
> >>>>
> >>>> We can only pin pages that the user has already allocated* and mapped
> >>>> through the vfio DMA API.  The pinning of the page simply locks the
> >>>> page for the vendor driver to access it and unpinning that page only
> >>>> indicates that access is complete.  Pages are not freed when a vendor
> >>>> driver unpins them, they still exist and at this point we're now
> >>>> assuming the device dirtied the page while it was pinned.  Thanks,
> >>>>
> >>>> Alex
> >>>>
> >>>> * An exception here is that the page might be demand allocated and the
> >>>>     act of pinning the page could actually allocate the backing page for
> >>>>     the user if they have not faulted the page to trigger that allocation
> >>>>     previously.  That page remains mapped for the user's virtual address
> >>>>     space even after the unpinning though.
> >>>>     
> >>>
> >>> Yes, I can give an example in GVT.
> >>> when a gem_object is allocated in guest, before submitting it to guest
> >>> vGPU, gfx cmds in its ring buffer need to be pinned into GGTT to get a
> >>> global graphics address for hardware access. At that time, we shadow
> >>> those cmds and pin pages through vfio pin_pages(), and submit the shadow
> >>> gem_object to physial hardware.
> >>> After guest driver thinks the submitted gem_object has completed hardware
> >>> DMA, it unnpinnd those pinned GGTT graphics memory addresses. Then in
> >>> host, we unpin the shadow pages through vfio unpin_pages.
> >>> But, at this point, guest driver is still free to access the gem_object
> >>> through vCPUs, and guest user space is probably still mapping an object
> >>> into the gem_object in guest driver.
> >>> So, missing the dirty page tracking for unpinned pages would cause
> >>> data inconsitency.
> >>>      
> >>
> >> If pages are accessed by guest through vCPUs, then RAM module in QEMU
> >> will take care of tracking those pages as dirty.
> >>
> >> All unpinned pages might not be used, so tracking all unpinned pages
> >> during VM or application life time would also lead to tracking lots of
> >> stale pages, even though they are not being used. Increasing number of
> >> not needed pages could also lead to increasing migration data leading
> >> increase in migration downtime.  
> > 
> > We can't rely on the vCPU also dirtying a page, the overhead is
> > unavoidable.  It doesn't matter if the migration is fast if it's
> > incorrect.  We only need to track unpinned dirty pages while the
> > migration is active and the tracking is flushed on each log_sync
> > callback.  Thanks,
> >   
> 
>  From Yan's comment, pasted below, I thought, need to track all unpinned 
> pages during application or VM's lifetime.
> 
>  > There we should also be aware of that dirty page scope is pinned   
> pages  > + unpinned pages, which means ever since a page is pinned, it 
> should
>  > be regarded as dirty no matter whether it's unpinned later.  
> 
> But if its about tracking pages which are unpinned "while the migration 
> is active", then that set would be less, will do this change.

When we first start a pre-copy, all RAM (including anything previously
pinned) is dirty anyway.  I believe the log_sync callback is only
intended to report pages dirtied since the migration was started, or
since the last log_sync callback.  We then assume that currently pinned
pages are constantly dirty and previously pinned pages are dirty until
we've reported them through log_sync.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, back to index

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-12 17:03 [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices Kirti Wankhede
2019-11-12 17:03 ` [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson
2019-11-13  3:23     ` Yan Zhao
2019-11-13 19:02       ` Kirti Wankhede
2019-11-14  0:36         ` Yan Zhao
2019-11-14 18:55           ` Kirti Wankhede
2019-11-13 10:24     ` Cornelia Huck
2019-11-13 18:27       ` Alex Williamson
2019-11-13 19:29         ` Kirti Wankhede
2019-11-13 19:48           ` Alex Williamson
2019-11-13 20:17             ` Kirti Wankhede
2019-11-13 20:40               ` Alex Williamson
2019-11-14 18:49                 ` Kirti Wankhede
2019-11-12 17:03 ` [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson
2019-11-13 19:37     ` Kirti Wankhede
2019-11-13 20:07       ` Alex Williamson
2019-11-14 18:56         ` Kirti Wankhede
2019-11-14 21:06           ` Alex Williamson
2019-11-15  2:40             ` Yan Zhao
2019-11-15  3:21               ` Alex Williamson
2019-11-15  5:10                 ` Tian, Kevin
2019-11-19 23:16                   ` Alex Williamson
2019-11-20  1:04                     ` Tian, Kevin
2019-11-20  1:51                 ` Yan Zhao
2019-11-26  0:57             ` Yan Zhao
2019-12-03 18:04               ` Alex Williamson
2019-12-04 18:10                 ` Kirti Wankhede
2019-12-04 18:34                   ` Alex Williamson
2019-12-05  1:28                     ` Yan Zhao
2019-12-05  5:42                       ` Kirti Wankhede
2019-12-05  5:47                         ` Yan Zhao
2019-12-05  5:56                         ` Alex Williamson
2019-12-05  6:19                           ` Kirti Wankhede
2019-12-05  6:40                             ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson
2019-11-13 19:52     ` Kirti Wankhede
2019-11-13 20:22       ` Alex Williamson
2019-11-14 18:56         ` Kirti Wankhede
2019-11-14 21:08           ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 4/5] vfio iommu: Implementation of ioctl to get dirty pages bitmap Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 5/5] vfio iommu: Implementation of ioctl to get dirty bitmap before unmap Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git