All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices
@ 2020-02-07 19:42 ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Hi,

This patch set adds:
* New IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
  respect to IOMMU container rather than per device. All pages pinned by
  vendor driver through vfio_pin_pages external API has to be marked as
  dirty during  migration. When IOMMU capable device is present in the
  container and all pages are pinned and mapped, then all pages are marked
  dirty.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned should be considered
  as dirty.
  This ioctl is also used to start/stop dirty pages tracking for pinned and
  unpinned pages while migration is active.

* Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap before
  unmapping IO virtual address range.
  With vIOMMU, during pre-copy phase of migration, while CPUs are still
  running, IO virtual address unmap can happen while device still keeping
  reference of guest pfns. Those pages should be reported as dirty before
  unmap, so that VFIO user space application can copy content of those
  pages from source to destination.

* Patch 7 is proposed change to detect if IOMMU capable device driver is
  smart to report pages to be marked dirty by pinning pages using
  vfio_pin_pages() API.


Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required such that vendor
driver could report dirty pages to VFIO module during migration phases.

Adding revision history from previous QEMU patch set to understand KABI
changes done till now

v11 -> v12
- Changed bitmap allocation in vfio_iommu_type1.
- Remove atomicity of ref_count.
- Updated comments for migration device state structure about error
  reporting.
- Nit picks from v11 reviews

v10 -> v11
- Fix pin pages API to free vpfn if it is marked as unpinned tracking page.
- Added proposal to detect if IOMMU capable device calls external pin pages
  API to mark pages dirty.
- Nit picks from v10 reviews

v9 -> v10:
- Updated existing VFIO_IOMMU_UNMAP_DMA ioctl to get dirty pages bitmap
  during unmap while migration is active
- Added flag in VFIO_IOMMU_GET_INFO to indicate driver support dirty page
  tracking.
- If iommu_mapped, mark all pages dirty.
- Added unpinned pages tracking while migration is active.
- Updated comments for migration device state structure with bit
  combination table and state transition details.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load
  as suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of
  structure vfio_device_migration_info to be followed based on Alex's
  suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration,
  that is removed assumption that data will be till end of migration
  region.
- If vendor driver supports mappable sparsed region, map those region
  during setup state of save/load, similarly unmap those from cleanup
  routines.
- Handles race condition that causes data corruption in migration region
  during save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Split into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2
  bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined
  action on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with
  region type capability.
- Re-structured vfio_device_migration_info. This structure will be placed
  at 0th offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section
  of the region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested
  pages are copied.

Thanks,
Kirti


Kirti Wankhede (7):
  vfio: KABI for migration interface for device state
  vfio iommu: Remove atomicity of ref_count of pinned pages
  vfio iommu: Add ioctl definition for dirty pages tracking.
  vfio iommu: Implementation of ioctl to for dirty pages tracking.
  vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
  vfio iommu: Adds flag to indicate dirty pages tracking capability
    support
  vfio: Selective dirty page tracking if IOMMU backed device pins pages

 drivers/vfio/vfio.c             |  13 +-
 drivers/vfio/vfio_iommu_type1.c | 435 +++++++++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |   4 +-
 include/uapi/linux/vfio.h       | 267 +++++++++++++++++++++++-
 4 files changed, 692 insertions(+), 27 deletions(-)

-- 
2.7.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices
@ 2020-02-07 19:42 ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Hi,

This patch set adds:
* New IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
  respect to IOMMU container rather than per device. All pages pinned by
  vendor driver through vfio_pin_pages external API has to be marked as
  dirty during  migration. When IOMMU capable device is present in the
  container and all pages are pinned and mapped, then all pages are marked
  dirty.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned should be considered
  as dirty.
  This ioctl is also used to start/stop dirty pages tracking for pinned and
  unpinned pages while migration is active.

* Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap before
  unmapping IO virtual address range.
  With vIOMMU, during pre-copy phase of migration, while CPUs are still
  running, IO virtual address unmap can happen while device still keeping
  reference of guest pfns. Those pages should be reported as dirty before
  unmap, so that VFIO user space application can copy content of those
  pages from source to destination.

* Patch 7 is proposed change to detect if IOMMU capable device driver is
  smart to report pages to be marked dirty by pinning pages using
  vfio_pin_pages() API.


Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required such that vendor
driver could report dirty pages to VFIO module during migration phases.

Adding revision history from previous QEMU patch set to understand KABI
changes done till now

v11 -> v12
- Changed bitmap allocation in vfio_iommu_type1.
- Remove atomicity of ref_count.
- Updated comments for migration device state structure about error
  reporting.
- Nit picks from v11 reviews

v10 -> v11
- Fix pin pages API to free vpfn if it is marked as unpinned tracking page.
- Added proposal to detect if IOMMU capable device calls external pin pages
  API to mark pages dirty.
- Nit picks from v10 reviews

v9 -> v10:
- Updated existing VFIO_IOMMU_UNMAP_DMA ioctl to get dirty pages bitmap
  during unmap while migration is active
- Added flag in VFIO_IOMMU_GET_INFO to indicate driver support dirty page
  tracking.
- If iommu_mapped, mark all pages dirty.
- Added unpinned pages tracking while migration is active.
- Updated comments for migration device state structure with bit
  combination table and state transition details.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load
  as suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of
  structure vfio_device_migration_info to be followed based on Alex's
  suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration,
  that is removed assumption that data will be till end of migration
  region.
- If vendor driver supports mappable sparsed region, map those region
  during setup state of save/load, similarly unmap those from cleanup
  routines.
- Handles race condition that causes data corruption in migration region
  during save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Split into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2
  bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined
  action on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with
  region type capability.
- Re-structured vfio_device_migration_info. This structure will be placed
  at 0th offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section
  of the region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested
  pages are copied.

Thanks,
Kirti


Kirti Wankhede (7):
  vfio: KABI for migration interface for device state
  vfio iommu: Remove atomicity of ref_count of pinned pages
  vfio iommu: Add ioctl definition for dirty pages tracking.
  vfio iommu: Implementation of ioctl to for dirty pages tracking.
  vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
  vfio iommu: Adds flag to indicate dirty pages tracking capability
    support
  vfio: Selective dirty page tracking if IOMMU backed device pins pages

 drivers/vfio/vfio.c             |  13 +-
 drivers/vfio/vfio_iommu_type1.c | 435 +++++++++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |   4 +-
 include/uapi/linux/vfio.h       | 267 +++++++++++++++++++++++-
 4 files changed, 692 insertions(+), 27 deletions(-)

-- 
2.7.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
  2020-02-07 19:42 ` Kirti Wankhede
@ 2020-02-07 19:42   ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

- Defined MIGRATION region type and sub-type.

- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access.

- Defined device states and state transition details.

- Defined sequence to be followed while saving and resuming VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 208 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 208 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..572242620ce9 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
+#define VFIO_REGION_TYPE_MIGRATION              (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -379,6 +380,213 @@ struct vfio_region_gfx_edid {
 /* sub-types for VFIO_REGION_TYPE_CCW */
 #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
 
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
+
+/*
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      - User application writes this field to inform vendor driver about the
+ *        device state to be transitioned to.
+ *      - Vendor driver should take necessary actions to change device state.
+ *        On successful transition to given state, vendor driver should return
+ *        success on write(device_state, state) system call. If device state
+ *        transition fails, vendor driver should return error, -EFAULT.
+ *      - On user application side, if device state transition fails, i.e. if
+ *        write(device_state, state) returns error, read device_state again to
+ *        determine the current state of the device from vendor driver.
+ *      - Vendor driver should return previous state of the device unless vendor
+ *        driver has encountered an internal error, in which case vendor driver
+ *        may report the device_state VFIO_DEVICE_STATE_ERROR.
+ *	- User application must use the device reset ioctl in order to recover
+ *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
+ *	  indicated in a valid device state via reading device_state, the user
+ *	  application may decide attempt to transition the device to any valid
+ *	  state reachable from the current state or terminate itself.
+ *
+ *      device_state consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
+ *	  indicates _STOP state. When device is changed to _STOP, driver should
+ *	  stop device before write() returns.
+ *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
+ *        should start gathering device state information which will be provided
+ *        to VFIO user application to save device's state.
+ *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
+ *        prepare to resume device, data provided through migration region
+ *        should be used to resume device.
+ *      Bits 3 - 31 are reserved for future use. In order to preserve them,
+ *	user application should perform read-modify-write operation on this
+ *	field when modifying the specified bits.
+ *
+ *  +------- _RESUMING
+ *  |+------ _SAVING
+ *  ||+----- _RUNNING
+ *  |||
+ *  000b => Device Stopped, not saving or resuming
+ *  001b => Device running state, default state
+ *  010b => Stop Device & save device state, stop-and-copy state
+ *  011b => Device running and save device state, pre-copy state
+ *  100b => Device stopped and device state is resuming
+ *  101b => Invalid state
+ *  110b => Error state
+ *  111b => Invalid state
+ *
+ * State transitions:
+ *
+ *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
+ *                (100b)     (001b)     (011b)        (010b)       (000b)
+ * 0. Running or Default state
+ *                             |
+ *
+ * 1. Normal Shutdown (optional)
+ *                             |------------------------------------->|
+ *
+ * 2. Save state or Suspend
+ *                             |------------------------->|---------->|
+ *
+ * 3. Save state during live migration
+ *                             |----------->|------------>|---------->|
+ *
+ * 4. Resuming
+ *                  |<---------|
+ *
+ * 5. Resumed
+ *                  |--------->|
+ *
+ * 0. Default state of VFIO device is _RUNNNG when user application starts.
+ * 1. During normal user application shutdown, vfio device state changes
+ *    from _RUNNING to _STOP. This is optional, user application may or may not
+ *    perform this state transition and vendor driver may not need.
+ * 2. When user application save state or suspend application, device state
+ *    transitions from _RUNNING to stop-and-copy state and then to _STOP.
+ *    On state transition from _RUNNING to stop-and-copy, driver must
+ *    stop device, save device state and send it to application through
+ *    migration region. Sequence to be followed for such transition is given
+ *    below.
+ * 3. In user application live migration, state transitions from _RUNNING
+ *    to pre-copy to stop-and-copy to _STOP.
+ *    On state transition from _RUNNING to pre-copy, driver should start
+ *    gathering device state while application is still running and send device
+ *    state data to application through migration region.
+ *    On state transition from pre-copy to stop-and-copy, driver must stop
+ *    device, save device state and send it to user application through
+ *    migration region.
+ *    Sequence to be followed for above two transitions is given below.
+ * 4. To start resuming phase, device state should be transitioned from
+ *    _RUNNING to _RESUMING state.
+ *    In _RESUMING state, driver should use received device state data through
+ *    migration region to resume device.
+ * 5. On providing saved device data to driver, application should change state
+ *    from _RESUMING to _RUNNING.
+ *
+ * pending bytes: (read only)
+ *      Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read device data during _SAVING state or write
+ *      device data during _RESUMING state. See below for detail of sequence to
+ *      be followed.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to get size of data copied in
+ *      bytes in migration region during _SAVING state and write size of data
+ *      copied in bytes in migration region during _RESUMING state.
+ *
+ * Migration region looks like:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^
+ *  offset 0-trapped part        data_offset
+ *
+ * Structure vfio_device_migration_info is always followed by data section in
+ * the region, so data_offset will always be non-0. Offset from where data is
+ * copied is decided by kernel driver, data section can be trapped or mapped
+ * or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned. The user is not required to access via mmap regardless
+ * of the region mmap capabilities.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes, indicates start of new iteration to get device data.
+ *    Repeatative read on pending_bytes at this stage should have no side
+ *    effect.
+ *    If pending_bytes == 0, user application should not iterate to get data
+ *    for that device.
+ *    If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates vendor driver to make data available through
+ *    data section. Vendor driver should return this read operation only after
+ *    data is available from (region + data_offset) to (region + data_offset +
+ *    data_size).
+ * c. read data_size, amount of data in bytes available through migration
+ *    region.
+ *    Read on data_offset and data_size should return offset and size of current
+ *    buffer if user application reads those more than once here.
+ * d. read data of data_size bytes from (region + data_offset) from migration
+ *    region.
+ * e. process data.
+ * f. read pending_bytes, this read operation indicates data from previous
+ *    iteration had read. If pending_bytes > 0, goto step b.
+ *
+ * If there is any error during the above sequence, vendor driver can return
+ * error code for next read()/write() operation, that will terminate the loop
+ * and user should take next necessary action, for example, fail migration or
+ * terminate user application.
+ *
+ * User application can transition from _SAVING|_RUNNING (pre-copy state) to
+ * _SAVING (stop-and-copy) state regardless of pending bytes.
+ * User application should iterate in _SAVING (stop-and-copy) until
+ * pending_bytes is 0.
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
+ * b. write data of data_size to migration region from data_offset. Data size
+ *    should be data packet size at source during _SAVING.
+ * c. write data_size which indicates vendor driver that data is written in
+ *    migration region. Vendor driver should read this data from migration
+ *    region and resume device's state.
+ *
+ * For user application, data is opaque. User application should write data in
+ * the same order as received and should of same transaction size at source.
+ */
+
+struct vfio_device_migration_info {
+	__u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_STOP      (0)
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+				     VFIO_DEVICE_STATE_SAVING |  \
+				     VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_VALID(state) \
+	(state & VFIO_DEVICE_STATE_RESUMING ? \
+	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
+
+#define VFIO_DEVICE_STATE_ERROR			\
+		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)
+
+	__u32 reserved;
+	__u64 pending_bytes;
+	__u64 data_offset;
+	__u64 data_size;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
@ 2020-02-07 19:42   ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

- Defined MIGRATION region type and sub-type.

- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access.

- Defined device states and state transition details.

- Defined sequence to be followed while saving and resuming VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 208 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 208 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..572242620ce9 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
+#define VFIO_REGION_TYPE_MIGRATION              (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -379,6 +380,213 @@ struct vfio_region_gfx_edid {
 /* sub-types for VFIO_REGION_TYPE_CCW */
 #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
 
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
+
+/*
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      - User application writes this field to inform vendor driver about the
+ *        device state to be transitioned to.
+ *      - Vendor driver should take necessary actions to change device state.
+ *        On successful transition to given state, vendor driver should return
+ *        success on write(device_state, state) system call. If device state
+ *        transition fails, vendor driver should return error, -EFAULT.
+ *      - On user application side, if device state transition fails, i.e. if
+ *        write(device_state, state) returns error, read device_state again to
+ *        determine the current state of the device from vendor driver.
+ *      - Vendor driver should return previous state of the device unless vendor
+ *        driver has encountered an internal error, in which case vendor driver
+ *        may report the device_state VFIO_DEVICE_STATE_ERROR.
+ *	- User application must use the device reset ioctl in order to recover
+ *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
+ *	  indicated in a valid device state via reading device_state, the user
+ *	  application may decide attempt to transition the device to any valid
+ *	  state reachable from the current state or terminate itself.
+ *
+ *      device_state consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
+ *	  indicates _STOP state. When device is changed to _STOP, driver should
+ *	  stop device before write() returns.
+ *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
+ *        should start gathering device state information which will be provided
+ *        to VFIO user application to save device's state.
+ *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
+ *        prepare to resume device, data provided through migration region
+ *        should be used to resume device.
+ *      Bits 3 - 31 are reserved for future use. In order to preserve them,
+ *	user application should perform read-modify-write operation on this
+ *	field when modifying the specified bits.
+ *
+ *  +------- _RESUMING
+ *  |+------ _SAVING
+ *  ||+----- _RUNNING
+ *  |||
+ *  000b => Device Stopped, not saving or resuming
+ *  001b => Device running state, default state
+ *  010b => Stop Device & save device state, stop-and-copy state
+ *  011b => Device running and save device state, pre-copy state
+ *  100b => Device stopped and device state is resuming
+ *  101b => Invalid state
+ *  110b => Error state
+ *  111b => Invalid state
+ *
+ * State transitions:
+ *
+ *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
+ *                (100b)     (001b)     (011b)        (010b)       (000b)
+ * 0. Running or Default state
+ *                             |
+ *
+ * 1. Normal Shutdown (optional)
+ *                             |------------------------------------->|
+ *
+ * 2. Save state or Suspend
+ *                             |------------------------->|---------->|
+ *
+ * 3. Save state during live migration
+ *                             |----------->|------------>|---------->|
+ *
+ * 4. Resuming
+ *                  |<---------|
+ *
+ * 5. Resumed
+ *                  |--------->|
+ *
+ * 0. Default state of VFIO device is _RUNNNG when user application starts.
+ * 1. During normal user application shutdown, vfio device state changes
+ *    from _RUNNING to _STOP. This is optional, user application may or may not
+ *    perform this state transition and vendor driver may not need.
+ * 2. When user application save state or suspend application, device state
+ *    transitions from _RUNNING to stop-and-copy state and then to _STOP.
+ *    On state transition from _RUNNING to stop-and-copy, driver must
+ *    stop device, save device state and send it to application through
+ *    migration region. Sequence to be followed for such transition is given
+ *    below.
+ * 3. In user application live migration, state transitions from _RUNNING
+ *    to pre-copy to stop-and-copy to _STOP.
+ *    On state transition from _RUNNING to pre-copy, driver should start
+ *    gathering device state while application is still running and send device
+ *    state data to application through migration region.
+ *    On state transition from pre-copy to stop-and-copy, driver must stop
+ *    device, save device state and send it to user application through
+ *    migration region.
+ *    Sequence to be followed for above two transitions is given below.
+ * 4. To start resuming phase, device state should be transitioned from
+ *    _RUNNING to _RESUMING state.
+ *    In _RESUMING state, driver should use received device state data through
+ *    migration region to resume device.
+ * 5. On providing saved device data to driver, application should change state
+ *    from _RESUMING to _RUNNING.
+ *
+ * pending bytes: (read only)
+ *      Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read device data during _SAVING state or write
+ *      device data during _RESUMING state. See below for detail of sequence to
+ *      be followed.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to get size of data copied in
+ *      bytes in migration region during _SAVING state and write size of data
+ *      copied in bytes in migration region during _RESUMING state.
+ *
+ * Migration region looks like:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^
+ *  offset 0-trapped part        data_offset
+ *
+ * Structure vfio_device_migration_info is always followed by data section in
+ * the region, so data_offset will always be non-0. Offset from where data is
+ * copied is decided by kernel driver, data section can be trapped or mapped
+ * or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned. The user is not required to access via mmap regardless
+ * of the region mmap capabilities.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes, indicates start of new iteration to get device data.
+ *    Repeatative read on pending_bytes at this stage should have no side
+ *    effect.
+ *    If pending_bytes == 0, user application should not iterate to get data
+ *    for that device.
+ *    If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates vendor driver to make data available through
+ *    data section. Vendor driver should return this read operation only after
+ *    data is available from (region + data_offset) to (region + data_offset +
+ *    data_size).
+ * c. read data_size, amount of data in bytes available through migration
+ *    region.
+ *    Read on data_offset and data_size should return offset and size of current
+ *    buffer if user application reads those more than once here.
+ * d. read data of data_size bytes from (region + data_offset) from migration
+ *    region.
+ * e. process data.
+ * f. read pending_bytes, this read operation indicates data from previous
+ *    iteration had read. If pending_bytes > 0, goto step b.
+ *
+ * If there is any error during the above sequence, vendor driver can return
+ * error code for next read()/write() operation, that will terminate the loop
+ * and user should take next necessary action, for example, fail migration or
+ * terminate user application.
+ *
+ * User application can transition from _SAVING|_RUNNING (pre-copy state) to
+ * _SAVING (stop-and-copy) state regardless of pending bytes.
+ * User application should iterate in _SAVING (stop-and-copy) until
+ * pending_bytes is 0.
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
+ * b. write data of data_size to migration region from data_offset. Data size
+ *    should be data packet size at source during _SAVING.
+ * c. write data_size which indicates vendor driver that data is written in
+ *    migration region. Vendor driver should read this data from migration
+ *    region and resume device's state.
+ *
+ * For user application, data is opaque. User application should write data in
+ * the same order as received and should of same transaction size at source.
+ */
+
+struct vfio_device_migration_info {
+	__u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_STOP      (0)
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+				     VFIO_DEVICE_STATE_SAVING |  \
+				     VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_VALID(state) \
+	(state & VFIO_DEVICE_STATE_RESUMING ? \
+	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
+
+#define VFIO_DEVICE_STATE_ERROR			\
+		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)
+
+	__u32 reserved;
+	__u64 pending_bytes;
+	__u64 data_offset;
+	__u64 data_size;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages
  2020-02-07 19:42 ` Kirti Wankhede
@ 2020-02-07 19:42   ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

vfio_pfn.ref_count is always updated by holding iommu->lock, using atomic
variable is overkill.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a177bf2c6683..d386461e5d11 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -111,7 +111,7 @@ struct vfio_pfn {
 	struct rb_node		node;
 	dma_addr_t		iova;		/* Device address */
 	unsigned long		pfn;		/* Host pfn */
-	atomic_t		ref_count;
+	unsigned int		ref_count;
 };
 
 struct vfio_regions {
@@ -232,7 +232,7 @@ static int vfio_add_to_pfn_list(struct vfio_dma *dma, dma_addr_t iova,
 
 	vpfn->iova = iova;
 	vpfn->pfn = pfn;
-	atomic_set(&vpfn->ref_count, 1);
+	vpfn->ref_count = 1;
 	vfio_link_pfn(dma, vpfn);
 	return 0;
 }
@@ -250,7 +250,7 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
 	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
 
 	if (vpfn)
-		atomic_inc(&vpfn->ref_count);
+		vpfn->ref_count++;
 	return vpfn;
 }
 
@@ -258,7 +258,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 {
 	int ret = 0;
 
-	if (atomic_dec_and_test(&vpfn->ref_count)) {
+	vpfn->ref_count--;
+	if (!vpfn->ref_count) {
 		ret = put_pfn(vpfn->pfn, dma->prot);
 		vfio_remove_from_pfn_list(dma, vpfn);
 	}
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages
@ 2020-02-07 19:42   ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

vfio_pfn.ref_count is always updated by holding iommu->lock, using atomic
variable is overkill.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a177bf2c6683..d386461e5d11 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -111,7 +111,7 @@ struct vfio_pfn {
 	struct rb_node		node;
 	dma_addr_t		iova;		/* Device address */
 	unsigned long		pfn;		/* Host pfn */
-	atomic_t		ref_count;
+	unsigned int		ref_count;
 };
 
 struct vfio_regions {
@@ -232,7 +232,7 @@ static int vfio_add_to_pfn_list(struct vfio_dma *dma, dma_addr_t iova,
 
 	vpfn->iova = iova;
 	vpfn->pfn = pfn;
-	atomic_set(&vpfn->ref_count, 1);
+	vpfn->ref_count = 1;
 	vfio_link_pfn(dma, vpfn);
 	return 0;
 }
@@ -250,7 +250,7 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
 	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
 
 	if (vpfn)
-		atomic_inc(&vpfn->ref_count);
+		vpfn->ref_count++;
 	return vpfn;
 }
 
@@ -258,7 +258,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 {
 	int ret = 0;
 
-	if (atomic_dec_and_test(&vpfn->ref_count)) {
+	vpfn->ref_count--;
+	if (!vpfn->ref_count) {
 		ret = put_pfn(vpfn->pfn, dma->prot);
 		vfio_remove_from_pfn_list(dma, vpfn);
 	}
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking.
  2020-02-07 19:42 ` Kirti Wankhede
@ 2020-02-07 19:42   ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

IOMMU container maintains a list of all pages pinned by vfio_pin_pages API.
All pages pinned by vendor driver through this API should be considered as
dirty during migration. When container consists of IOMMU capable device and
all pages are pinned and mapped, then all pages are marked dirty.
Added support to start/stop unpinned pages tracking and to get bitmap of
all dirtied pages for requested IO virtual address range.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 572242620ce9..b1b03c720749 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1002,6 +1002,48 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_DIRTY_PAGES - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
+ *                                     struct vfio_iommu_type1_dirty_bitmap)
+ * IOCTL is used for dirty pages tracking. Caller sets argsz, which is size of
+ * struct vfio_iommu_type1_dirty_bitmap. Caller set flag depend on which
+ * operation to perform, details as below:
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_START set, indicates
+ * migration is active and IOMMU module should track pages which are being
+ * unpinned. Unpinned pages are tracked until tracking is stopped by user
+ * application by setting VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP flag.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP set, indicates
+ * IOMMU should stop tracking unpinned pages and also free previously tracked
+ * unpinned pages data.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP flag set,
+ * IOCTL returns dirty pages bitmap for IOMMU container during migration for
+ * given IOVA range. User must allocate memory to get bitmap, zero the bitmap
+ * memory and set size of allocated memory in bitmap_size field. One bit is
+ * used to represent one page consecutively starting from iova offset. User
+ * should provide page size in 'pgsize'. Bit set in bitmap indicates page at
+ * that offset from iova is dirty.
+ *
+ * Only one flag should be set at a time.
+ *
+ */
+struct vfio_iommu_type1_dirty_bitmap {
+	__u32        argsz;
+	__u32        flags;
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_START	(1 << 0)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP	(1 << 1)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP	(1 << 2)
+	__u64        iova;		/* IO virtual address */
+	__u64        size;		/* Size of iova range */
+	__u64	     pgsize;		/* page size for bitmap */
+	__u64        bitmap_size;	/* in bytes */
+	void __user *bitmap;		/* one bit per page */
+};
+
+#define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking.
@ 2020-02-07 19:42   ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

IOMMU container maintains a list of all pages pinned by vfio_pin_pages API.
All pages pinned by vendor driver through this API should be considered as
dirty during migration. When container consists of IOMMU capable device and
all pages are pinned and mapped, then all pages are marked dirty.
Added support to start/stop unpinned pages tracking and to get bitmap of
all dirtied pages for requested IO virtual address range.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 572242620ce9..b1b03c720749 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1002,6 +1002,48 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_DIRTY_PAGES - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
+ *                                     struct vfio_iommu_type1_dirty_bitmap)
+ * IOCTL is used for dirty pages tracking. Caller sets argsz, which is size of
+ * struct vfio_iommu_type1_dirty_bitmap. Caller set flag depend on which
+ * operation to perform, details as below:
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_START set, indicates
+ * migration is active and IOMMU module should track pages which are being
+ * unpinned. Unpinned pages are tracked until tracking is stopped by user
+ * application by setting VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP flag.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP set, indicates
+ * IOMMU should stop tracking unpinned pages and also free previously tracked
+ * unpinned pages data.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP flag set,
+ * IOCTL returns dirty pages bitmap for IOMMU container during migration for
+ * given IOVA range. User must allocate memory to get bitmap, zero the bitmap
+ * memory and set size of allocated memory in bitmap_size field. One bit is
+ * used to represent one page consecutively starting from iova offset. User
+ * should provide page size in 'pgsize'. Bit set in bitmap indicates page at
+ * that offset from iova is dirty.
+ *
+ * Only one flag should be set at a time.
+ *
+ */
+struct vfio_iommu_type1_dirty_bitmap {
+	__u32        argsz;
+	__u32        flags;
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_START	(1 << 0)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP	(1 << 1)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP	(1 << 2)
+	__u64        iova;		/* IO virtual address */
+	__u64        size;		/* Size of iova range */
+	__u64	     pgsize;		/* page size for bitmap */
+	__u64        bitmap_size;	/* in bytes */
+	void __user *bitmap;		/* one bit per page */
+};
+
+#define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-07 19:42 ` Kirti Wankhede
@ 2020-02-07 19:42   ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
- Start pinned and unpinned pages tracking while migration is active
- Stop pinned and unpinned dirty pages tracking. This is also used to
  stop dirty pages tracking if migration failed or cancelled.
- Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
  user space application responsibility to copy content of dirty pages
  from source to destination during migration.

To prevent DoS attack, memory for bitmap is allocated per vfio_dma
structure. Bitmap size is calculated considering smallest supported page
size. Bitmap is allocated when dirty logging is enabled for those
vfio_dmas whose vpfn list is not empty or whole range is mapped, in
case of pass-through device.

There could be multiple option as to when bitmap should be populated:
* Polulate bitmap for already pinned pages when bitmap is allocated for
  a vfio_dma with the smallest supported page size. Updates bitmap from
  page pinning and unpinning functions. When user application queries
  bitmap, check if requested page size is same as page size used to
  populated bitmap. If it is equal, copy bitmap. But if not equal,
  re-populated bitmap according to requested page size and then copy to
  user.
  Pros: Bitmap gets populated on the fly after dirty tracking has
        started.
  Cons: If requested page size is different than smallest supported
        page size, then bitmap has to be re-populated again, with
        additional overhead of allocating bitmap memory again for
        re-population of bitmap.

* Populate bitmap when bitmap is queried by user application.
  Pros: Bitmap is populated with requested page size. This eliminates
        the need to re-populate bitmap if requested page size is
        different than smallest supported pages size.
  Cons: There is one time processing time, when bitmap is queried.

I prefer later option with simple logic and to eliminate over-head of
bitmap repopulation in case of differnt page sizes. Later option is
implemented in this patch.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 287 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d386461e5d11..df358dc1c85b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -70,6 +70,7 @@ struct vfio_iommu {
 	unsigned int		dma_avail;
 	bool			v2;
 	bool			nesting;
+	bool			dirty_page_tracking;
 };
 
 struct vfio_domain {
@@ -90,6 +91,7 @@ struct vfio_dma {
 	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
+	unsigned long		*bitmap;
 };
 
 struct vfio_group {
@@ -125,6 +127,7 @@ struct vfio_regions {
 					(!list_empty(&iommu->domain_list))
 
 static int put_pfn(unsigned long pfn, int prot);
+static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
 
 /*
  * This code handles mapping and unmapping of user data buffers
@@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
+{
+	if (!npages)
+		return 0;
+
+	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
+}
+
+static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
+				 struct vfio_dma *dma, unsigned long pgsizes)
+{
+	unsigned long pgshift = __ffs(pgsizes);
+
+	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
+		unsigned long npages = dma->size >> pgshift;
+		unsigned long bsize = dirty_bitmap_bytes(npages);
+
+		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);
+		if (!dma->bitmap)
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
+				     unsigned long pgsizes)
+{
+	struct rb_node *n = rb_first(&iommu->dma_list);
+	int ret;
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
+{
+	struct rb_node *n = rb_first(&iommu->dma_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		kfree(dma->bitmap);
+	}
+}
+
 /*
  * Helper Functions for host iova-pfn list
  */
@@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
 	kfree(vpfn);
 }
 
+static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
+{
+	struct rb_node *n = rb_first(&dma->pfn_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
+
+		if (!vpfn->ref_count)
+			vfio_remove_from_pfn_list(dma, vpfn);
+	}
+}
+
+static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
+{
+	struct rb_node *n = rb_first(&iommu->dma_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		vfio_remove_unpinned_from_pfn_list(dma);
+	}
+}
+
 static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
 					       unsigned long iova)
 {
@@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 	vpfn->ref_count--;
 	if (!vpfn->ref_count) {
 		ret = put_pfn(vpfn->pfn, dma->prot);
-		vfio_remove_from_pfn_list(dma, vpfn);
+		if (!dma->bitmap)
+			vfio_remove_from_pfn_list(dma, vpfn);
 	}
 	return ret;
 }
@@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	return ret;
 }
 
-static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
+static int vfio_unpin_page_external(struct vfio_iommu *iommu,
+				    struct vfio_dma *dma, dma_addr_t iova,
 				    bool do_accounting)
 {
 	int unlocked;
 	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
 
-	if (!vpfn)
+	if (!vpfn || !vpfn->ref_count)
 		return 0;
 
 	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
@@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	unsigned long remote_vaddr;
 	struct vfio_dma *dma;
 	bool do_accounting;
+	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
 
 	if (!iommu || !user_pfn || !phys_pfn)
 		return -EINVAL;
@@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
 		if (vpfn) {
-			phys_pfn[i] = vpfn->pfn;
-			continue;
+			if (vpfn->ref_count > 1) {
+				phys_pfn[i] = vpfn->pfn;
+				continue;
+			}
 		}
 
 		remote_vaddr = dma->vaddr + iova - dma->iova;
@@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 					     do_accounting);
 		if (ret)
 			goto pin_unwind;
-
-		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
-		if (ret) {
-			vfio_unpin_page_external(dma, iova, do_accounting);
-			goto pin_unwind;
+		if (!vpfn) {
+			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
+			if (ret) {
+				vfio_unpin_page_external(iommu, dma, iova,
+							 do_accounting);
+				goto pin_unwind;
+			}
+		} else
+			vpfn->pfn = phys_pfn[i];
+
+		if (iommu->dirty_page_tracking && !dma->bitmap) {
+			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
+			if (ret) {
+				vfio_unpin_page_external(iommu, dma, iova,
+							 do_accounting);
+				goto pin_unwind;
+			}
 		}
 	}
 
@@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 		iova = user_pfn[j] << PAGE_SHIFT;
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
-		vfio_unpin_page_external(dma, iova, do_accounting);
+		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
 		phys_pfn[j] = 0;
 	}
 pin_done:
@@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
 		if (!dma)
 			goto unpin_exit;
-		vfio_unpin_page_external(dma, iova, do_accounting);
+		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
 	}
 
 unpin_exit:
@@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 	return bitmap;
 }
 
+static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
+				  size_t size, uint64_t pgsize,
+				  unsigned char __user *bitmap)
+{
+	struct vfio_dma *dma;
+	dma_addr_t i = iova, iova_limit;
+	unsigned int bsize, nbits = 0, l = 0;
+	unsigned long pgshift = __ffs(pgsize);
+
+	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
+		int ret, j;
+		unsigned int npages = 0, shift = 0;
+		unsigned char temp = 0;
+
+		/* mark all pages dirty if all pages are pinned and mapped. */
+		if (dma->iommu_mapped) {
+			iova_limit = min(dma->iova + dma->size, iova + size);
+			npages = iova_limit/pgsize;
+			bitmap_set(dma->bitmap, 0, npages);
+		} else if (dma->bitmap) {
+			struct rb_node *n = rb_first(&dma->pfn_list);
+			bool found = false;
+
+			for (; n; n = rb_next(n)) {
+				struct vfio_pfn *vpfn = rb_entry(n,
+						struct vfio_pfn, node);
+				if (vpfn->iova >= i) {
+					found = true;
+					break;
+				}
+			}
+
+			if (!found) {
+				i += dma->size;
+				continue;
+			}
+
+			for (; n; n = rb_next(n)) {
+				unsigned int s;
+				struct vfio_pfn *vpfn = rb_entry(n,
+						struct vfio_pfn, node);
+
+				if (vpfn->iova >= iova + size)
+					break;
+
+				s = (vpfn->iova - dma->iova) >> pgshift;
+				bitmap_set(dma->bitmap, s, 1);
+
+				iova_limit = vpfn->iova + pgsize;
+			}
+			npages = iova_limit/pgsize;
+		}
+
+		bsize = dirty_bitmap_bytes(npages);
+		shift = nbits % BITS_PER_BYTE;
+
+		if (npages && shift) {
+			l--;
+			if (!access_ok((void __user *)bitmap + l,
+					sizeof(unsigned char)))
+				return -EINVAL;
+
+			ret = __get_user(temp, bitmap + l);
+			if (ret)
+				return ret;
+		}
+
+		for (j = 0; j < bsize; j++, l++) {
+			temp = temp |
+			       (*((unsigned char *)dma->bitmap + j) << shift);
+			if (!access_ok((void __user *)bitmap + l,
+					sizeof(unsigned char)))
+				return -EINVAL;
+
+			ret = __put_user(temp, bitmap + l);
+			if (ret)
+				return ret;
+			if (shift) {
+				temp = *((unsigned char *)dma->bitmap + j) >>
+					(BITS_PER_BYTE - shift);
+			}
+		}
+
+		nbits += npages;
+
+		i = min(dma->iova + dma->size, iova + size);
+		if (i >= iova + size)
+			break;
+	}
+	return 0;
+}
+
+static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
+{
+	long bsize;
+
+	if (!bitmap_size || bitmap_size > SIZE_MAX)
+		return -EINVAL;
+
+	bsize = dirty_bitmap_bytes(npages);
+
+	if (bitmap_size < bsize)
+		return -EINVAL;
+
+	return bsize;
+}
+
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 			     struct vfio_iommu_type1_dma_unmap *unmap)
 {
@@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
+		struct vfio_iommu_type1_dirty_bitmap range;
+		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
+				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
+				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+		int ret;
+
+		if (!iommu->v2)
+			return -EACCES;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
+				    bitmap);
+
+		if (copy_from_user(&range, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (range.argsz < minsz || range.flags & ~mask)
+			return -EINVAL;
+
+		/* only one flag should be set at a time */
+		if (__ffs(range.flags) != __fls(range.flags))
+			return -EINVAL;
+
+		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
+			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
+
+			mutex_lock(&iommu->lock);
+			iommu->dirty_page_tracking = true;
+			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);
+			mutex_unlock(&iommu->lock);
+			return ret;
+		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
+			mutex_lock(&iommu->lock);
+			iommu->dirty_page_tracking = false;
+			vfio_dma_all_bitmap_free(iommu);
+			vfio_remove_unpinned_from_dma_list(iommu);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		} else if (range.flags &
+				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
+			long bsize;
+			unsigned long pgshift = __ffs(range.pgsize);
+			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
+			uint64_t iommu_pgmask =
+				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
+
+			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
+				return -EINVAL;
+			if (range.iova & iommu_pgmask)
+				return -EINVAL;
+			if (!range.size || range.size & iommu_pgmask)
+				return -EINVAL;
+			if (range.iova + range.size < range.iova)
+				return -EINVAL;
+			if (!access_ok((void __user *)range.bitmap,
+				       range.bitmap_size))
+				return -EINVAL;
+
+			bsize = verify_bitmap_size(range.size >> pgshift,
+						   range.bitmap_size);
+			if (bsize < 0)
+				return bsize;
+
+			mutex_lock(&iommu->lock);
+			if (iommu->dirty_page_tracking)
+				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
+					 range.size, range.pgsize,
+					 (unsigned char __user *)range.bitmap);
+			else
+				ret = -EINVAL;
+			mutex_unlock(&iommu->lock);
+
+			return ret;
+		}
 	}
 
 	return -ENOTTY;
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-07 19:42   ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
- Start pinned and unpinned pages tracking while migration is active
- Stop pinned and unpinned dirty pages tracking. This is also used to
  stop dirty pages tracking if migration failed or cancelled.
- Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
  user space application responsibility to copy content of dirty pages
  from source to destination during migration.

To prevent DoS attack, memory for bitmap is allocated per vfio_dma
structure. Bitmap size is calculated considering smallest supported page
size. Bitmap is allocated when dirty logging is enabled for those
vfio_dmas whose vpfn list is not empty or whole range is mapped, in
case of pass-through device.

There could be multiple option as to when bitmap should be populated:
* Polulate bitmap for already pinned pages when bitmap is allocated for
  a vfio_dma with the smallest supported page size. Updates bitmap from
  page pinning and unpinning functions. When user application queries
  bitmap, check if requested page size is same as page size used to
  populated bitmap. If it is equal, copy bitmap. But if not equal,
  re-populated bitmap according to requested page size and then copy to
  user.
  Pros: Bitmap gets populated on the fly after dirty tracking has
        started.
  Cons: If requested page size is different than smallest supported
        page size, then bitmap has to be re-populated again, with
        additional overhead of allocating bitmap memory again for
        re-population of bitmap.

* Populate bitmap when bitmap is queried by user application.
  Pros: Bitmap is populated with requested page size. This eliminates
        the need to re-populate bitmap if requested page size is
        different than smallest supported pages size.
  Cons: There is one time processing time, when bitmap is queried.

I prefer later option with simple logic and to eliminate over-head of
bitmap repopulation in case of differnt page sizes. Later option is
implemented in this patch.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 287 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d386461e5d11..df358dc1c85b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -70,6 +70,7 @@ struct vfio_iommu {
 	unsigned int		dma_avail;
 	bool			v2;
 	bool			nesting;
+	bool			dirty_page_tracking;
 };
 
 struct vfio_domain {
@@ -90,6 +91,7 @@ struct vfio_dma {
 	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
+	unsigned long		*bitmap;
 };
 
 struct vfio_group {
@@ -125,6 +127,7 @@ struct vfio_regions {
 					(!list_empty(&iommu->domain_list))
 
 static int put_pfn(unsigned long pfn, int prot);
+static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
 
 /*
  * This code handles mapping and unmapping of user data buffers
@@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
+{
+	if (!npages)
+		return 0;
+
+	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
+}
+
+static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
+				 struct vfio_dma *dma, unsigned long pgsizes)
+{
+	unsigned long pgshift = __ffs(pgsizes);
+
+	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
+		unsigned long npages = dma->size >> pgshift;
+		unsigned long bsize = dirty_bitmap_bytes(npages);
+
+		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);
+		if (!dma->bitmap)
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
+				     unsigned long pgsizes)
+{
+	struct rb_node *n = rb_first(&iommu->dma_list);
+	int ret;
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
+{
+	struct rb_node *n = rb_first(&iommu->dma_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		kfree(dma->bitmap);
+	}
+}
+
 /*
  * Helper Functions for host iova-pfn list
  */
@@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
 	kfree(vpfn);
 }
 
+static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
+{
+	struct rb_node *n = rb_first(&dma->pfn_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
+
+		if (!vpfn->ref_count)
+			vfio_remove_from_pfn_list(dma, vpfn);
+	}
+}
+
+static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
+{
+	struct rb_node *n = rb_first(&iommu->dma_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		vfio_remove_unpinned_from_pfn_list(dma);
+	}
+}
+
 static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
 					       unsigned long iova)
 {
@@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 	vpfn->ref_count--;
 	if (!vpfn->ref_count) {
 		ret = put_pfn(vpfn->pfn, dma->prot);
-		vfio_remove_from_pfn_list(dma, vpfn);
+		if (!dma->bitmap)
+			vfio_remove_from_pfn_list(dma, vpfn);
 	}
 	return ret;
 }
@@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	return ret;
 }
 
-static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
+static int vfio_unpin_page_external(struct vfio_iommu *iommu,
+				    struct vfio_dma *dma, dma_addr_t iova,
 				    bool do_accounting)
 {
 	int unlocked;
 	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
 
-	if (!vpfn)
+	if (!vpfn || !vpfn->ref_count)
 		return 0;
 
 	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
@@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	unsigned long remote_vaddr;
 	struct vfio_dma *dma;
 	bool do_accounting;
+	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
 
 	if (!iommu || !user_pfn || !phys_pfn)
 		return -EINVAL;
@@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
 		if (vpfn) {
-			phys_pfn[i] = vpfn->pfn;
-			continue;
+			if (vpfn->ref_count > 1) {
+				phys_pfn[i] = vpfn->pfn;
+				continue;
+			}
 		}
 
 		remote_vaddr = dma->vaddr + iova - dma->iova;
@@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 					     do_accounting);
 		if (ret)
 			goto pin_unwind;
-
-		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
-		if (ret) {
-			vfio_unpin_page_external(dma, iova, do_accounting);
-			goto pin_unwind;
+		if (!vpfn) {
+			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
+			if (ret) {
+				vfio_unpin_page_external(iommu, dma, iova,
+							 do_accounting);
+				goto pin_unwind;
+			}
+		} else
+			vpfn->pfn = phys_pfn[i];
+
+		if (iommu->dirty_page_tracking && !dma->bitmap) {
+			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
+			if (ret) {
+				vfio_unpin_page_external(iommu, dma, iova,
+							 do_accounting);
+				goto pin_unwind;
+			}
 		}
 	}
 
@@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 		iova = user_pfn[j] << PAGE_SHIFT;
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
-		vfio_unpin_page_external(dma, iova, do_accounting);
+		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
 		phys_pfn[j] = 0;
 	}
 pin_done:
@@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
 		if (!dma)
 			goto unpin_exit;
-		vfio_unpin_page_external(dma, iova, do_accounting);
+		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
 	}
 
 unpin_exit:
@@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 	return bitmap;
 }
 
+static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
+				  size_t size, uint64_t pgsize,
+				  unsigned char __user *bitmap)
+{
+	struct vfio_dma *dma;
+	dma_addr_t i = iova, iova_limit;
+	unsigned int bsize, nbits = 0, l = 0;
+	unsigned long pgshift = __ffs(pgsize);
+
+	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
+		int ret, j;
+		unsigned int npages = 0, shift = 0;
+		unsigned char temp = 0;
+
+		/* mark all pages dirty if all pages are pinned and mapped. */
+		if (dma->iommu_mapped) {
+			iova_limit = min(dma->iova + dma->size, iova + size);
+			npages = iova_limit/pgsize;
+			bitmap_set(dma->bitmap, 0, npages);
+		} else if (dma->bitmap) {
+			struct rb_node *n = rb_first(&dma->pfn_list);
+			bool found = false;
+
+			for (; n; n = rb_next(n)) {
+				struct vfio_pfn *vpfn = rb_entry(n,
+						struct vfio_pfn, node);
+				if (vpfn->iova >= i) {
+					found = true;
+					break;
+				}
+			}
+
+			if (!found) {
+				i += dma->size;
+				continue;
+			}
+
+			for (; n; n = rb_next(n)) {
+				unsigned int s;
+				struct vfio_pfn *vpfn = rb_entry(n,
+						struct vfio_pfn, node);
+
+				if (vpfn->iova >= iova + size)
+					break;
+
+				s = (vpfn->iova - dma->iova) >> pgshift;
+				bitmap_set(dma->bitmap, s, 1);
+
+				iova_limit = vpfn->iova + pgsize;
+			}
+			npages = iova_limit/pgsize;
+		}
+
+		bsize = dirty_bitmap_bytes(npages);
+		shift = nbits % BITS_PER_BYTE;
+
+		if (npages && shift) {
+			l--;
+			if (!access_ok((void __user *)bitmap + l,
+					sizeof(unsigned char)))
+				return -EINVAL;
+
+			ret = __get_user(temp, bitmap + l);
+			if (ret)
+				return ret;
+		}
+
+		for (j = 0; j < bsize; j++, l++) {
+			temp = temp |
+			       (*((unsigned char *)dma->bitmap + j) << shift);
+			if (!access_ok((void __user *)bitmap + l,
+					sizeof(unsigned char)))
+				return -EINVAL;
+
+			ret = __put_user(temp, bitmap + l);
+			if (ret)
+				return ret;
+			if (shift) {
+				temp = *((unsigned char *)dma->bitmap + j) >>
+					(BITS_PER_BYTE - shift);
+			}
+		}
+
+		nbits += npages;
+
+		i = min(dma->iova + dma->size, iova + size);
+		if (i >= iova + size)
+			break;
+	}
+	return 0;
+}
+
+static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
+{
+	long bsize;
+
+	if (!bitmap_size || bitmap_size > SIZE_MAX)
+		return -EINVAL;
+
+	bsize = dirty_bitmap_bytes(npages);
+
+	if (bitmap_size < bsize)
+		return -EINVAL;
+
+	return bsize;
+}
+
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 			     struct vfio_iommu_type1_dma_unmap *unmap)
 {
@@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
+		struct vfio_iommu_type1_dirty_bitmap range;
+		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
+				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
+				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+		int ret;
+
+		if (!iommu->v2)
+			return -EACCES;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
+				    bitmap);
+
+		if (copy_from_user(&range, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (range.argsz < minsz || range.flags & ~mask)
+			return -EINVAL;
+
+		/* only one flag should be set at a time */
+		if (__ffs(range.flags) != __fls(range.flags))
+			return -EINVAL;
+
+		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
+			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
+
+			mutex_lock(&iommu->lock);
+			iommu->dirty_page_tracking = true;
+			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);
+			mutex_unlock(&iommu->lock);
+			return ret;
+		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
+			mutex_lock(&iommu->lock);
+			iommu->dirty_page_tracking = false;
+			vfio_dma_all_bitmap_free(iommu);
+			vfio_remove_unpinned_from_dma_list(iommu);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		} else if (range.flags &
+				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
+			long bsize;
+			unsigned long pgshift = __ffs(range.pgsize);
+			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
+			uint64_t iommu_pgmask =
+				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
+
+			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
+				return -EINVAL;
+			if (range.iova & iommu_pgmask)
+				return -EINVAL;
+			if (!range.size || range.size & iommu_pgmask)
+				return -EINVAL;
+			if (range.iova + range.size < range.iova)
+				return -EINVAL;
+			if (!access_ok((void __user *)range.bitmap,
+				       range.bitmap_size))
+				return -EINVAL;
+
+			bsize = verify_bitmap_size(range.size >> pgshift,
+						   range.bitmap_size);
+			if (bsize < 0)
+				return bsize;
+
+			mutex_lock(&iommu->lock);
+			if (iommu->dirty_page_tracking)
+				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
+					 range.size, range.pgsize,
+					 (unsigned char __user *)range.bitmap);
+			else
+				ret = -EINVAL;
+			mutex_unlock(&iommu->lock);
+
+			return ret;
+		}
 	}
 
 	return -ENOTTY;
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
  2020-02-07 19:42 ` Kirti Wankhede
@ 2020-02-07 19:42   ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Pages, pinned by external interface for requested IO virtual address
range,  might get unpinned  and unmapped while migration is active and
device is still running, that is, in pre-copy phase while guest driver
still could access those pages. Host device can write to these pages while
those were mapped. Such pages should be marked dirty so that after
migration guest driver should still be able to complete the operation.

To get bitmap during unmap, user should set flag
VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
zeroed by user space application. Bitmap size and page size should be set
by user application.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 56 +++++++++++++++++++++++++++++++++++++----
 include/uapi/linux/vfio.h       | 12 +++++++++
 2 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index df358dc1c85b..4e6ad0513932 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1032,7 +1032,8 @@ static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
 }
 
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
-			     struct vfio_iommu_type1_dma_unmap *unmap)
+			     struct vfio_iommu_type1_dma_unmap *unmap,
+			     unsigned long *bitmap)
 {
 	uint64_t mask;
 	struct vfio_dma *dma, *dma_last = NULL;
@@ -1107,6 +1108,15 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (dma->task->mm != current->mm)
 			break;
 
+		if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
+		    (dma_last != dma))
+			vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size,
+					       unmap->bitmap_pgsize,
+					      (unsigned char __user *) bitmap);
+		else
+			vfio_remove_unpinned_from_pfn_list(dma);
+
+
 		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
 			struct vfio_iommu_type1_dma_unmap nb_unmap;
 
@@ -1132,6 +1142,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 						    &nb_unmap);
 			goto again;
 		}
+
 		unmapped += dma->size;
 		vfio_remove_dma(iommu, dma);
 	}
@@ -2462,22 +2473,57 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
 		struct vfio_iommu_type1_dma_unmap unmap;
-		long ret;
+		unsigned long *bitmap = NULL;
+		long ret, bsize;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
 
 		if (copy_from_user(&unmap, (void __user *)arg, minsz))
 			return -EFAULT;
 
-		if (unmap.argsz < minsz || unmap.flags)
+		if (unmap.argsz < minsz ||
+		    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
 			return -EINVAL;
 
-		ret = vfio_dma_do_unmap(iommu, &unmap);
+		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+			unsigned long pgshift;
+			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
+			uint64_t iommu_pgmask =
+				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
+
+			if (copy_from_user(&unmap, (void __user *)arg,
+					   sizeof(unmap)))
+				return -EFAULT;
+
+			pgshift = __ffs(unmap.bitmap_pgsize);
+
+			if (((unmap.bitmap_pgsize - 1) & iommu_pgmask) !=
+			     (unmap.bitmap_pgsize - 1))
+				return -EINVAL;
+
+			if ((unmap.bitmap_pgsize & iommu_pgsizes) !=
+			     unmap.bitmap_pgsize)
+				return -EINVAL;
+			if (unmap.iova + unmap.size < unmap.iova)
+				return -EINVAL;
+			if (!access_ok((void __user *)unmap.bitmap,
+				       unmap.bitmap_size))
+				return -EINVAL;
+
+			bsize = verify_bitmap_size(unmap.size >> pgshift,
+						   unmap.bitmap_size);
+			if (bsize < 0)
+				return bsize;
+			bitmap = unmap.bitmap;
+		}
+
+		ret = vfio_dma_do_unmap(iommu, &unmap, bitmap);
 		if (ret)
 			return ret;
 
-		return copy_to_user((void __user *)arg, &unmap, minsz) ?
+		ret = copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+		return ret;
 	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
 		struct vfio_iommu_type1_dirty_bitmap range;
 		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b1b03c720749..a852e729b5a2 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -985,12 +985,24 @@ struct vfio_iommu_type1_dma_map {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
+ * before unmapping IO virtual addresses. When this flag is set, user should
+ * allocate memory to get bitmap, clear the bitmap memory by setting zero and
+ * should set size of allocated memory in bitmap_size field. One bit in bitmap
+ * represents per page , page of user provided page size in 'bitmap_pgsize',
+ * consecutively starting from iova offset. Bit set indicates page at that
+ * offset from iova is dirty. Bitmap of pages in the range of unmapped size is
+ * returned in bitmap.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
+	__u64        bitmap_pgsize;		/* page size for bitmap */
+	__u64        bitmap_size;               /* in bytes */
+	void __user *bitmap;                    /* one bit per page */
 };
 
 #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
@ 2020-02-07 19:42   ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Pages, pinned by external interface for requested IO virtual address
range,  might get unpinned  and unmapped while migration is active and
device is still running, that is, in pre-copy phase while guest driver
still could access those pages. Host device can write to these pages while
those were mapped. Such pages should be marked dirty so that after
migration guest driver should still be able to complete the operation.

To get bitmap during unmap, user should set flag
VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
zeroed by user space application. Bitmap size and page size should be set
by user application.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 56 +++++++++++++++++++++++++++++++++++++----
 include/uapi/linux/vfio.h       | 12 +++++++++
 2 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index df358dc1c85b..4e6ad0513932 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1032,7 +1032,8 @@ static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
 }
 
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
-			     struct vfio_iommu_type1_dma_unmap *unmap)
+			     struct vfio_iommu_type1_dma_unmap *unmap,
+			     unsigned long *bitmap)
 {
 	uint64_t mask;
 	struct vfio_dma *dma, *dma_last = NULL;
@@ -1107,6 +1108,15 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (dma->task->mm != current->mm)
 			break;
 
+		if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
+		    (dma_last != dma))
+			vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size,
+					       unmap->bitmap_pgsize,
+					      (unsigned char __user *) bitmap);
+		else
+			vfio_remove_unpinned_from_pfn_list(dma);
+
+
 		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
 			struct vfio_iommu_type1_dma_unmap nb_unmap;
 
@@ -1132,6 +1142,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 						    &nb_unmap);
 			goto again;
 		}
+
 		unmapped += dma->size;
 		vfio_remove_dma(iommu, dma);
 	}
@@ -2462,22 +2473,57 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
 		struct vfio_iommu_type1_dma_unmap unmap;
-		long ret;
+		unsigned long *bitmap = NULL;
+		long ret, bsize;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
 
 		if (copy_from_user(&unmap, (void __user *)arg, minsz))
 			return -EFAULT;
 
-		if (unmap.argsz < minsz || unmap.flags)
+		if (unmap.argsz < minsz ||
+		    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
 			return -EINVAL;
 
-		ret = vfio_dma_do_unmap(iommu, &unmap);
+		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+			unsigned long pgshift;
+			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
+			uint64_t iommu_pgmask =
+				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
+
+			if (copy_from_user(&unmap, (void __user *)arg,
+					   sizeof(unmap)))
+				return -EFAULT;
+
+			pgshift = __ffs(unmap.bitmap_pgsize);
+
+			if (((unmap.bitmap_pgsize - 1) & iommu_pgmask) !=
+			     (unmap.bitmap_pgsize - 1))
+				return -EINVAL;
+
+			if ((unmap.bitmap_pgsize & iommu_pgsizes) !=
+			     unmap.bitmap_pgsize)
+				return -EINVAL;
+			if (unmap.iova + unmap.size < unmap.iova)
+				return -EINVAL;
+			if (!access_ok((void __user *)unmap.bitmap,
+				       unmap.bitmap_size))
+				return -EINVAL;
+
+			bsize = verify_bitmap_size(unmap.size >> pgshift,
+						   unmap.bitmap_size);
+			if (bsize < 0)
+				return bsize;
+			bitmap = unmap.bitmap;
+		}
+
+		ret = vfio_dma_do_unmap(iommu, &unmap, bitmap);
 		if (ret)
 			return ret;
 
-		return copy_to_user((void __user *)arg, &unmap, minsz) ?
+		ret = copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+		return ret;
 	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
 		struct vfio_iommu_type1_dirty_bitmap range;
 		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b1b03c720749..a852e729b5a2 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -985,12 +985,24 @@ struct vfio_iommu_type1_dma_map {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
+ * before unmapping IO virtual addresses. When this flag is set, user should
+ * allocate memory to get bitmap, clear the bitmap memory by setting zero and
+ * should set size of allocated memory in bitmap_size field. One bit in bitmap
+ * represents per page , page of user provided page size in 'bitmap_pgsize',
+ * consecutively starting from iova offset. Bit set indicates page at that
+ * offset from iova is dirty. Bitmap of pages in the range of unmapped size is
+ * returned in bitmap.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
+	__u64        bitmap_pgsize;		/* page size for bitmap */
+	__u64        bitmap_size;               /* in bytes */
+	void __user *bitmap;                    /* one bit per page */
 };
 
 #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support
  2020-02-07 19:42 ` Kirti Wankhede
@ 2020-02-07 19:42   ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Flag VFIO_IOMMU_INFO_DIRTY_PGS in VFIO_IOMMU_GET_INFO indicates that driver
support dirty pages tracking.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 3 ++-
 include/uapi/linux/vfio.h       | 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4e6ad0513932..f748a3dbe9f9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2426,7 +2426,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 			info.cap_offset = 0; /* output, no-recopy necessary */
 		}
 
-		info.flags = VFIO_IOMMU_INFO_PGSIZES;
+		info.flags = VFIO_IOMMU_INFO_PGSIZES |
+			     VFIO_IOMMU_INFO_DIRTY_PGS;
 
 		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a852e729b5a2..8528e835541d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -928,8 +928,9 @@ struct vfio_device_ioeventfd {
 struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
-#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
-#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
+#define VFIO_IOMMU_INFO_PGSIZES   (1 << 0) /* supported page sizes info */
+#define VFIO_IOMMU_INFO_CAPS      (1 << 1) /* Info supports caps */
+#define VFIO_IOMMU_INFO_DIRTY_PGS (1 << 2) /* supports dirty page tracking */
 	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
 };
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support
@ 2020-02-07 19:42   ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Flag VFIO_IOMMU_INFO_DIRTY_PGS in VFIO_IOMMU_GET_INFO indicates that driver
support dirty pages tracking.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 3 ++-
 include/uapi/linux/vfio.h       | 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4e6ad0513932..f748a3dbe9f9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2426,7 +2426,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 			info.cap_offset = 0; /* output, no-recopy necessary */
 		}
 
-		info.flags = VFIO_IOMMU_INFO_PGSIZES;
+		info.flags = VFIO_IOMMU_INFO_PGSIZES |
+			     VFIO_IOMMU_INFO_DIRTY_PGS;
 
 		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a852e729b5a2..8528e835541d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -928,8 +928,9 @@ struct vfio_device_ioeventfd {
 struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
-#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
-#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
+#define VFIO_IOMMU_INFO_PGSIZES   (1 << 0) /* supported page sizes info */
+#define VFIO_IOMMU_INFO_CAPS      (1 << 1) /* Info supports caps */
+#define VFIO_IOMMU_INFO_DIRTY_PGS (1 << 2) /* supports dirty page tracking */
 	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
 };
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages
  2020-02-07 19:42 ` Kirti Wankhede
@ 2020-02-07 19:42   ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Added a check such that only singleton IOMMU groups can pin pages.
From the point when vendor driver pins any pages, consider IOMMU group
dirty page scope to be limited to pinned pages.

To optimize to avoid walking list often, added flag
pinned_page_dirty_scope to indicate if all of the vfio_groups for each
vfio_domain in the domain_list dirty page scope is limited to pinned
pages. This flag is updated on first pinned pages request for that IOMMU
group and on attaching/detaching group.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio.c             | 13 +++++++-
 drivers/vfio/vfio_iommu_type1.c | 72 +++++++++++++++++++++++++++++++++++++++--
 include/linux/vfio.h            |  4 ++-
 3 files changed, 84 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c8482624ca34..a941c860b440 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -87,6 +87,7 @@ struct vfio_group {
 	bool				noiommu;
 	struct kvm			*kvm;
 	struct blocking_notifier_head	notifier;
+	bool				is_singleton;
 };
 
 struct vfio_device {
@@ -838,6 +839,12 @@ int vfio_add_group_dev(struct device *dev,
 		return PTR_ERR(device);
 	}
 
+	mutex_lock(&group->device_lock);
+	group->is_singleton = false;
+	if (list_is_singular(&group->device_list))
+		group->is_singleton = true;
+	mutex_unlock(&group->device_lock);
+
 	/*
 	 * Drop all but the vfio_device reference.  The vfio_device holds
 	 * a reference to the vfio_group, which holds a reference to the
@@ -1895,6 +1902,9 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
 	if (!group)
 		return -ENODEV;
 
+	if (!group->is_singleton)
+		return -EINVAL;
+
 	ret = vfio_group_add_container_user(group);
 	if (ret)
 		goto err_pin_pages;
@@ -1902,7 +1912,8 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
 	container = group->container;
 	driver = container->iommu_driver;
 	if (likely(driver && driver->ops->pin_pages))
-		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+		ret = driver->ops->pin_pages(container->iommu_data,
+					     group->iommu_group, user_pfn,
 					     npage, prot, phys_pfn);
 	else
 		ret = -ENOTTY;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f748a3dbe9f9..a787a2bcd757 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -71,6 +71,7 @@ struct vfio_iommu {
 	bool			v2;
 	bool			nesting;
 	bool			dirty_page_tracking;
+	bool			pinned_page_dirty_scope;
 };
 
 struct vfio_domain {
@@ -98,6 +99,7 @@ struct vfio_group {
 	struct iommu_group	*iommu_group;
 	struct list_head	next;
 	bool			mdev_group;	/* An mdev group */
+	bool			has_pinned_pages;
 };
 
 struct vfio_iova {
@@ -129,6 +131,10 @@ struct vfio_regions {
 static int put_pfn(unsigned long pfn, int prot);
 static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
 
+static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
+					       struct iommu_group *iommu_group);
+
+static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu);
 /*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
@@ -580,11 +586,13 @@ static int vfio_unpin_page_external(struct vfio_iommu *iommu,
 }
 
 static int vfio_iommu_type1_pin_pages(void *iommu_data,
+				      struct iommu_group *iommu_group,
 				      unsigned long *user_pfn,
 				      int npage, int prot,
 				      unsigned long *phys_pfn)
 {
 	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group;
 	int i, j, ret;
 	unsigned long remote_vaddr;
 	struct vfio_dma *dma;
@@ -661,8 +669,14 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 			}
 		}
 	}
-
 	ret = i;
+
+	group = vfio_iommu_find_iommu_group(iommu, iommu_group);
+	if (!group->has_pinned_pages) {
+		group->has_pinned_pages = true;
+		update_pinned_page_dirty_scope(iommu);
+	}
+
 	goto pin_done;
 
 pin_unwind:
@@ -938,8 +952,11 @@ static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
 		unsigned int npages = 0, shift = 0;
 		unsigned char temp = 0;
 
-		/* mark all pages dirty if all pages are pinned and mapped. */
-		if (dma->iommu_mapped) {
+		/*
+		 * mark all pages dirty if any IOMMU capable device is not able
+		 * to report dirty pages and all pages are pinned and mapped.
+		 */
+		if (!iommu->pinned_page_dirty_scope && dma->iommu_mapped) {
 			iova_limit = min(dma->iova + dma->size, iova + size);
 			npages = iova_limit/pgsize;
 			bitmap_set(dma->bitmap, 0, npages);
@@ -1479,6 +1496,51 @@ static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
 	return NULL;
 }
 
+static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
+					       struct iommu_group *iommu_group)
+{
+	struct vfio_domain *domain;
+	struct vfio_group *group = NULL;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = find_iommu_group(domain, iommu_group);
+		if (group)
+			return group;
+	}
+
+	if (iommu->external_domain)
+		group = find_iommu_group(iommu->external_domain, iommu_group);
+
+	return group;
+}
+
+static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain;
+	struct vfio_group *group;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		list_for_each_entry(group, &domain->group_list, next) {
+			if (!group->has_pinned_pages) {
+				iommu->pinned_page_dirty_scope = false;
+				return;
+			}
+		}
+	}
+
+	if (iommu->external_domain) {
+		domain = iommu->external_domain;
+		list_for_each_entry(group, &domain->group_list, next) {
+			if (!group->has_pinned_pages) {
+				iommu->pinned_page_dirty_scope = false;
+				return;
+			}
+		}
+	}
+
+	iommu->pinned_page_dirty_scope = true;
+}
+
 static bool vfio_iommu_has_sw_msi(struct list_head *group_resv_regions,
 				  phys_addr_t *base)
 {
@@ -1885,6 +1947,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 			list_add(&group->next,
 				 &iommu->external_domain->group_list);
+			update_pinned_page_dirty_scope(iommu);
 			mutex_unlock(&iommu->lock);
 
 			return 0;
@@ -2007,6 +2070,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 done:
 	/* Delete the old one and insert new iova list */
 	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
+	update_pinned_page_dirty_scope(iommu);
 	mutex_unlock(&iommu->lock);
 	vfio_iommu_resv_free(&group_resv_regions);
 
@@ -2021,6 +2085,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 out_free:
 	kfree(domain);
 	kfree(group);
+	update_pinned_page_dirty_scope(iommu);
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -2225,6 +2290,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 		vfio_iommu_iova_free(&iova_copy);
 
 detach_group_done:
+	update_pinned_page_dirty_scope(iommu);
 	mutex_unlock(&iommu->lock);
 }
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e42a711a2800..da29802d6276 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -72,7 +72,9 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-	int		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
+	int		(*pin_pages)(void *iommu_data,
+				     struct iommu_group *group,
+				     unsigned long *user_pfn,
 				     int npage, int prot,
 				     unsigned long *phys_pfn);
 	int		(*unpin_pages)(void *iommu_data,
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v12 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages
@ 2020-02-07 19:42   ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-07 19:42 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Added a check such that only singleton IOMMU groups can pin pages.
From the point when vendor driver pins any pages, consider IOMMU group
dirty page scope to be limited to pinned pages.

To optimize to avoid walking list often, added flag
pinned_page_dirty_scope to indicate if all of the vfio_groups for each
vfio_domain in the domain_list dirty page scope is limited to pinned
pages. This flag is updated on first pinned pages request for that IOMMU
group and on attaching/detaching group.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio.c             | 13 +++++++-
 drivers/vfio/vfio_iommu_type1.c | 72 +++++++++++++++++++++++++++++++++++++++--
 include/linux/vfio.h            |  4 ++-
 3 files changed, 84 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c8482624ca34..a941c860b440 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -87,6 +87,7 @@ struct vfio_group {
 	bool				noiommu;
 	struct kvm			*kvm;
 	struct blocking_notifier_head	notifier;
+	bool				is_singleton;
 };
 
 struct vfio_device {
@@ -838,6 +839,12 @@ int vfio_add_group_dev(struct device *dev,
 		return PTR_ERR(device);
 	}
 
+	mutex_lock(&group->device_lock);
+	group->is_singleton = false;
+	if (list_is_singular(&group->device_list))
+		group->is_singleton = true;
+	mutex_unlock(&group->device_lock);
+
 	/*
 	 * Drop all but the vfio_device reference.  The vfio_device holds
 	 * a reference to the vfio_group, which holds a reference to the
@@ -1895,6 +1902,9 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
 	if (!group)
 		return -ENODEV;
 
+	if (!group->is_singleton)
+		return -EINVAL;
+
 	ret = vfio_group_add_container_user(group);
 	if (ret)
 		goto err_pin_pages;
@@ -1902,7 +1912,8 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
 	container = group->container;
 	driver = container->iommu_driver;
 	if (likely(driver && driver->ops->pin_pages))
-		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+		ret = driver->ops->pin_pages(container->iommu_data,
+					     group->iommu_group, user_pfn,
 					     npage, prot, phys_pfn);
 	else
 		ret = -ENOTTY;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f748a3dbe9f9..a787a2bcd757 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -71,6 +71,7 @@ struct vfio_iommu {
 	bool			v2;
 	bool			nesting;
 	bool			dirty_page_tracking;
+	bool			pinned_page_dirty_scope;
 };
 
 struct vfio_domain {
@@ -98,6 +99,7 @@ struct vfio_group {
 	struct iommu_group	*iommu_group;
 	struct list_head	next;
 	bool			mdev_group;	/* An mdev group */
+	bool			has_pinned_pages;
 };
 
 struct vfio_iova {
@@ -129,6 +131,10 @@ struct vfio_regions {
 static int put_pfn(unsigned long pfn, int prot);
 static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
 
+static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
+					       struct iommu_group *iommu_group);
+
+static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu);
 /*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
@@ -580,11 +586,13 @@ static int vfio_unpin_page_external(struct vfio_iommu *iommu,
 }
 
 static int vfio_iommu_type1_pin_pages(void *iommu_data,
+				      struct iommu_group *iommu_group,
 				      unsigned long *user_pfn,
 				      int npage, int prot,
 				      unsigned long *phys_pfn)
 {
 	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group;
 	int i, j, ret;
 	unsigned long remote_vaddr;
 	struct vfio_dma *dma;
@@ -661,8 +669,14 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 			}
 		}
 	}
-
 	ret = i;
+
+	group = vfio_iommu_find_iommu_group(iommu, iommu_group);
+	if (!group->has_pinned_pages) {
+		group->has_pinned_pages = true;
+		update_pinned_page_dirty_scope(iommu);
+	}
+
 	goto pin_done;
 
 pin_unwind:
@@ -938,8 +952,11 @@ static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
 		unsigned int npages = 0, shift = 0;
 		unsigned char temp = 0;
 
-		/* mark all pages dirty if all pages are pinned and mapped. */
-		if (dma->iommu_mapped) {
+		/*
+		 * mark all pages dirty if any IOMMU capable device is not able
+		 * to report dirty pages and all pages are pinned and mapped.
+		 */
+		if (!iommu->pinned_page_dirty_scope && dma->iommu_mapped) {
 			iova_limit = min(dma->iova + dma->size, iova + size);
 			npages = iova_limit/pgsize;
 			bitmap_set(dma->bitmap, 0, npages);
@@ -1479,6 +1496,51 @@ static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
 	return NULL;
 }
 
+static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
+					       struct iommu_group *iommu_group)
+{
+	struct vfio_domain *domain;
+	struct vfio_group *group = NULL;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = find_iommu_group(domain, iommu_group);
+		if (group)
+			return group;
+	}
+
+	if (iommu->external_domain)
+		group = find_iommu_group(iommu->external_domain, iommu_group);
+
+	return group;
+}
+
+static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain;
+	struct vfio_group *group;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		list_for_each_entry(group, &domain->group_list, next) {
+			if (!group->has_pinned_pages) {
+				iommu->pinned_page_dirty_scope = false;
+				return;
+			}
+		}
+	}
+
+	if (iommu->external_domain) {
+		domain = iommu->external_domain;
+		list_for_each_entry(group, &domain->group_list, next) {
+			if (!group->has_pinned_pages) {
+				iommu->pinned_page_dirty_scope = false;
+				return;
+			}
+		}
+	}
+
+	iommu->pinned_page_dirty_scope = true;
+}
+
 static bool vfio_iommu_has_sw_msi(struct list_head *group_resv_regions,
 				  phys_addr_t *base)
 {
@@ -1885,6 +1947,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 			list_add(&group->next,
 				 &iommu->external_domain->group_list);
+			update_pinned_page_dirty_scope(iommu);
 			mutex_unlock(&iommu->lock);
 
 			return 0;
@@ -2007,6 +2070,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 done:
 	/* Delete the old one and insert new iova list */
 	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
+	update_pinned_page_dirty_scope(iommu);
 	mutex_unlock(&iommu->lock);
 	vfio_iommu_resv_free(&group_resv_regions);
 
@@ -2021,6 +2085,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 out_free:
 	kfree(domain);
 	kfree(group);
+	update_pinned_page_dirty_scope(iommu);
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -2225,6 +2290,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 		vfio_iommu_iova_free(&iova_copy);
 
 detach_group_done:
+	update_pinned_page_dirty_scope(iommu);
 	mutex_unlock(&iommu->lock);
 }
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e42a711a2800..da29802d6276 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -72,7 +72,9 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-	int		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
+	int		(*pin_pages)(void *iommu_data,
+				     struct iommu_group *group,
+				     unsigned long *user_pfn,
 				     int npage, int prot,
 				     unsigned long *phys_pfn);
 	int		(*unpin_pages)(void *iommu_data,
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-07 19:42   ` Kirti Wankhede
@ 2020-02-10  9:49     ` Yan Zhao
  -1 siblings, 0 replies; 62+ messages in thread
From: Yan Zhao @ 2020-02-10  9:49 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:
> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> - Start pinned and unpinned pages tracking while migration is active
> - Stop pinned and unpinned dirty pages tracking. This is also used to
>   stop dirty pages tracking if migration failed or cancelled.
> - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
>   user space application responsibility to copy content of dirty pages
>   from source to destination during migration.
> 
> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> structure. Bitmap size is calculated considering smallest supported page
> size. Bitmap is allocated when dirty logging is enabled for those
> vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> case of pass-through device.
> 
> There could be multiple option as to when bitmap should be populated:
> * Polulate bitmap for already pinned pages when bitmap is allocated for
>   a vfio_dma with the smallest supported page size. Updates bitmap from
>   page pinning and unpinning functions. When user application queries
>   bitmap, check if requested page size is same as page size used to
>   populated bitmap. If it is equal, copy bitmap. But if not equal,
>   re-populated bitmap according to requested page size and then copy to
>   user.
>   Pros: Bitmap gets populated on the fly after dirty tracking has
>         started.
>   Cons: If requested page size is different than smallest supported
>         page size, then bitmap has to be re-populated again, with
>         additional overhead of allocating bitmap memory again for
>         re-population of bitmap.
> 
> * Populate bitmap when bitmap is queried by user application.
>   Pros: Bitmap is populated with requested page size. This eliminates
>         the need to re-populate bitmap if requested page size is
>         different than smallest supported pages size.
>   Cons: There is one time processing time, when bitmap is queried.
> 
> I prefer later option with simple logic and to eliminate over-head of
> bitmap repopulation in case of differnt page sizes. Later option is
> implemented in this patch.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 287 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d386461e5d11..df358dc1c85b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
>  	unsigned int		dma_avail;
>  	bool			v2;
>  	bool			nesting;
> +	bool			dirty_page_tracking;
>  };
>  
>  struct vfio_domain {
> @@ -90,6 +91,7 @@ struct vfio_dma {
>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> +	unsigned long		*bitmap;
>  };
>  
>  struct vfio_group {
> @@ -125,6 +127,7 @@ struct vfio_regions {
>  					(!list_empty(&iommu->domain_list))
>  
>  static int put_pfn(unsigned long pfn, int prot);
> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
>  
>  /*
>   * This code handles mapping and unmapping of user data buffers
> @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
> +{
> +	if (!npages)
> +		return 0;
> +
> +	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> +}
> +
> +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
> +				 struct vfio_dma *dma, unsigned long pgsizes)
> +{
> +	unsigned long pgshift = __ffs(pgsizes);
> +
> +	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
> +		unsigned long npages = dma->size >> pgshift;
> +		unsigned long bsize = dirty_bitmap_bytes(npages);
> +
> +		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);
> +		if (!dma->bitmap)
> +			return -ENOMEM;
> +	}
> +	return 0;
> +}
> +
> +static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
> +				     unsigned long pgsizes)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +	int ret;
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
> +		if (ret)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
> +static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		kfree(dma->bitmap);
> +	}
> +}
> +
>  /*
>   * Helper Functions for host iova-pfn list
>   */
> @@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>  	kfree(vpfn);
>  }
>  
> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
> +{
> +	struct rb_node *n = rb_first(&dma->pfn_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> +
> +		if (!vpfn->ref_count)
> +			vfio_remove_from_pfn_list(dma, vpfn);
> +	}
> +}
> +
> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		vfio_remove_unpinned_from_pfn_list(dma);
> +	}
> +}
> +
>  static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>  					       unsigned long iova)
>  {
> @@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>  	vpfn->ref_count--;
>  	if (!vpfn->ref_count) {
>  		ret = put_pfn(vpfn->pfn, dma->prot);
> -		vfio_remove_from_pfn_list(dma, vpfn);
> +		if (!dma->bitmap)
> +			vfio_remove_from_pfn_list(dma, vpfn);
>  	}
>  	return ret;
>  }
> @@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>  	return ret;
>  }
>  
> -static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> +static int vfio_unpin_page_external(struct vfio_iommu *iommu,
> +				    struct vfio_dma *dma, dma_addr_t iova,
>  				    bool do_accounting)
>  {
>  	int unlocked;
>  	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
>  
> -	if (!vpfn)
> +	if (!vpfn || !vpfn->ref_count)
>  		return 0;
>  
>  	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> @@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  	unsigned long remote_vaddr;
>  	struct vfio_dma *dma;
>  	bool do_accounting;
> +	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>  
>  	if (!iommu || !user_pfn || !phys_pfn)
>  		return -EINVAL;
> @@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>  		if (vpfn) {
> -			phys_pfn[i] = vpfn->pfn;
> -			continue;
> +			if (vpfn->ref_count > 1) {
> +				phys_pfn[i] = vpfn->pfn;
> +				continue;
> +			}
>  		}
>  
>  		remote_vaddr = dma->vaddr + iova - dma->iova;
> @@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  					     do_accounting);
>  		if (ret)
>  			goto pin_unwind;
> -
> -		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> -		if (ret) {
> -			vfio_unpin_page_external(dma, iova, do_accounting);
> -			goto pin_unwind;
> +		if (!vpfn) {
> +			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> +			if (ret) {
> +				vfio_unpin_page_external(iommu, dma, iova,
> +							 do_accounting);
> +				goto pin_unwind;
> +			}
> +		} else
> +			vpfn->pfn = phys_pfn[i];
> +
> +		if (iommu->dirty_page_tracking && !dma->bitmap) {
> +			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
> +			if (ret) {
> +				vfio_unpin_page_external(iommu, dma, iova,
> +							 do_accounting);
> +				goto pin_unwind;
> +			}
>  		}
>  	}
>  
> @@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		iova = user_pfn[j] << PAGE_SHIFT;
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>  		phys_pfn[j] = 0;
>  	}
>  pin_done:
> @@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>  		if (!dma)
>  			goto unpin_exit;
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>  	}
>  
>  unpin_exit:
> @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	return bitmap;
>  }
>  
> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> +				  size_t size, uint64_t pgsize,
> +				  unsigned char __user *bitmap)
> +{
> +	struct vfio_dma *dma;
> +	dma_addr_t i = iova, iova_limit;
> +	unsigned int bsize, nbits = 0, l = 0;
> +	unsigned long pgshift = __ffs(pgsize);
> +
> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> +		int ret, j;
> +		unsigned int npages = 0, shift = 0;
> +		unsigned char temp = 0;
> +
> +		/* mark all pages dirty if all pages are pinned and mapped. */
> +		if (dma->iommu_mapped) {
> +			iova_limit = min(dma->iova + dma->size, iova + size);
> +			npages = iova_limit/pgsize;
> +			bitmap_set(dma->bitmap, 0, npages);
for pass-through devices, it's not good to always return all pinned pages as
dirty. could it also call vfio_pin_pages to track dirty pages? or any
other interface provided to do that?
> +		} else if (dma->bitmap) {
> +			struct rb_node *n = rb_first(&dma->pfn_list);
> +			bool found = false;
> +
> +			for (; n; n = rb_next(n)) {
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +						struct vfio_pfn, node);
> +				if (vpfn->iova >= i) {
> +					found = true;
> +					break;
> +				}
> +			}
> +
> +			if (!found) {
> +				i += dma->size;
> +				continue;
> +			}
> +
> +			for (; n; n = rb_next(n)) {
> +				unsigned int s;
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +						struct vfio_pfn, node);
> +
> +				if (vpfn->iova >= iova + size)
> +					break;
> +
> +				s = (vpfn->iova - dma->iova) >> pgshift;
> +				bitmap_set(dma->bitmap, s, 1);
> +
should not set the dma->bitmap when user space asks for dirty bitmap.
should set the bits for all vpfns when dirty page tracking starts, and clear
it after putting them to user space.
dma->bitmap is set when vfio_pin_pages is called during dirty page
tracking.
> +				iova_limit = vpfn->iova + pgsize;
> +			}
> +			npages = iova_limit/pgsize;
> +		}
> +
> +		bsize = dirty_bitmap_bytes(npages);
> +		shift = nbits % BITS_PER_BYTE;
> +
> +		if (npages && shift) {
> +			l--;
> +			if (!access_ok((void __user *)bitmap + l,
> +					sizeof(unsigned char)))
> +				return -EINVAL;
> +
> +			ret = __get_user(temp, bitmap + l);
> +			if (ret)
> +				return ret;
> +		}
> +
> +		for (j = 0; j < bsize; j++, l++) {
> +			temp = temp |
> +			       (*((unsigned char *)dma->bitmap + j) << shift);
> +			if (!access_ok((void __user *)bitmap + l,
> +					sizeof(unsigned char)))
> +				return -EINVAL;
> +
> +			ret = __put_user(temp, bitmap + l);
> +			if (ret)
> +				return ret;
> +			if (shift) {
> +				temp = *((unsigned char *)dma->bitmap + j) >>
> +					(BITS_PER_BYTE - shift);
> +			}
> +		}
> +
> +		nbits += npages;
> +
> +		i = min(dma->iova + dma->size, iova + size);
> +		if (i >= iova + size)
> +			break;
> +	}
> +	return 0;
> +}
> +
> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> +{
> +	long bsize;
> +
> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> +		return -EINVAL;
> +
> +	bsize = dirty_bitmap_bytes(npages);
> +
> +	if (bitmap_size < bsize)
> +		return -EINVAL;
> +
> +	return bsize;
> +}
> +
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  			     struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> +		struct vfio_iommu_type1_dirty_bitmap range;
> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +		int ret;
> +
> +		if (!iommu->v2)
> +			return -EACCES;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> +				    bitmap);
> +
> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (range.argsz < minsz || range.flags & ~mask)
> +			return -EINVAL;
> +
> +		/* only one flag should be set at a time */
> +		if (__ffs(range.flags) != __fls(range.flags))
> +			return -EINVAL;
> +
> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +
> +			mutex_lock(&iommu->lock);
> +			iommu->dirty_page_tracking = true;
should only set iommu->dirty_page_tracking = true after bitmap alloc
succeeds.
> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);
> +			mutex_unlock(&iommu->lock);
> +			return ret;
> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> +			mutex_lock(&iommu->lock);
> +			iommu->dirty_page_tracking = false;
> +			vfio_dma_all_bitmap_free(iommu);
> +			vfio_remove_unpinned_from_dma_list(iommu);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		} else if (range.flags &
> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> +			long bsize;
> +			unsigned long pgshift = __ffs(range.pgsize);
> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +			uint64_t iommu_pgmask =
> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
> +
> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
> +				return -EINVAL;
> +			if (range.iova & iommu_pgmask)
> +				return -EINVAL;
> +			if (!range.size || range.size & iommu_pgmask)
> +				return -EINVAL;
> +			if (range.iova + range.size < range.iova)
> +				return -EINVAL;
> +			if (!access_ok((void __user *)range.bitmap,
> +				       range.bitmap_size))
> +				return -EINVAL;
> +
> +			bsize = verify_bitmap_size(range.size >> pgshift,
> +						   range.bitmap_size);
> +			if (bsize < 0)
> +				return bsize;
> +
> +			mutex_lock(&iommu->lock);
> +			if (iommu->dirty_page_tracking)
> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
> +					 range.size, range.pgsize,
> +					 (unsigned char __user *)range.bitmap);
> +			else
> +				ret = -EINVAL;
> +			mutex_unlock(&iommu->lock);
> +
> +			return ret;
> +		}
>  	}
>  
>  	return -ENOTTY;
> -- 
> 2.7.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-10  9:49     ` Yan Zhao
  0 siblings, 0 replies; 62+ messages in thread
From: Yan Zhao @ 2020-02-10  9:49 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, kvm, eskultet, Yang,
	Ziye, qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:
> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> - Start pinned and unpinned pages tracking while migration is active
> - Stop pinned and unpinned dirty pages tracking. This is also used to
>   stop dirty pages tracking if migration failed or cancelled.
> - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
>   user space application responsibility to copy content of dirty pages
>   from source to destination during migration.
> 
> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> structure. Bitmap size is calculated considering smallest supported page
> size. Bitmap is allocated when dirty logging is enabled for those
> vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> case of pass-through device.
> 
> There could be multiple option as to when bitmap should be populated:
> * Polulate bitmap for already pinned pages when bitmap is allocated for
>   a vfio_dma with the smallest supported page size. Updates bitmap from
>   page pinning and unpinning functions. When user application queries
>   bitmap, check if requested page size is same as page size used to
>   populated bitmap. If it is equal, copy bitmap. But if not equal,
>   re-populated bitmap according to requested page size and then copy to
>   user.
>   Pros: Bitmap gets populated on the fly after dirty tracking has
>         started.
>   Cons: If requested page size is different than smallest supported
>         page size, then bitmap has to be re-populated again, with
>         additional overhead of allocating bitmap memory again for
>         re-population of bitmap.
> 
> * Populate bitmap when bitmap is queried by user application.
>   Pros: Bitmap is populated with requested page size. This eliminates
>         the need to re-populate bitmap if requested page size is
>         different than smallest supported pages size.
>   Cons: There is one time processing time, when bitmap is queried.
> 
> I prefer later option with simple logic and to eliminate over-head of
> bitmap repopulation in case of differnt page sizes. Later option is
> implemented in this patch.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 287 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d386461e5d11..df358dc1c85b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
>  	unsigned int		dma_avail;
>  	bool			v2;
>  	bool			nesting;
> +	bool			dirty_page_tracking;
>  };
>  
>  struct vfio_domain {
> @@ -90,6 +91,7 @@ struct vfio_dma {
>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> +	unsigned long		*bitmap;
>  };
>  
>  struct vfio_group {
> @@ -125,6 +127,7 @@ struct vfio_regions {
>  					(!list_empty(&iommu->domain_list))
>  
>  static int put_pfn(unsigned long pfn, int prot);
> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
>  
>  /*
>   * This code handles mapping and unmapping of user data buffers
> @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
> +{
> +	if (!npages)
> +		return 0;
> +
> +	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> +}
> +
> +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
> +				 struct vfio_dma *dma, unsigned long pgsizes)
> +{
> +	unsigned long pgshift = __ffs(pgsizes);
> +
> +	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
> +		unsigned long npages = dma->size >> pgshift;
> +		unsigned long bsize = dirty_bitmap_bytes(npages);
> +
> +		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);
> +		if (!dma->bitmap)
> +			return -ENOMEM;
> +	}
> +	return 0;
> +}
> +
> +static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
> +				     unsigned long pgsizes)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +	int ret;
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
> +		if (ret)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
> +static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		kfree(dma->bitmap);
> +	}
> +}
> +
>  /*
>   * Helper Functions for host iova-pfn list
>   */
> @@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>  	kfree(vpfn);
>  }
>  
> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
> +{
> +	struct rb_node *n = rb_first(&dma->pfn_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> +
> +		if (!vpfn->ref_count)
> +			vfio_remove_from_pfn_list(dma, vpfn);
> +	}
> +}
> +
> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		vfio_remove_unpinned_from_pfn_list(dma);
> +	}
> +}
> +
>  static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>  					       unsigned long iova)
>  {
> @@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>  	vpfn->ref_count--;
>  	if (!vpfn->ref_count) {
>  		ret = put_pfn(vpfn->pfn, dma->prot);
> -		vfio_remove_from_pfn_list(dma, vpfn);
> +		if (!dma->bitmap)
> +			vfio_remove_from_pfn_list(dma, vpfn);
>  	}
>  	return ret;
>  }
> @@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>  	return ret;
>  }
>  
> -static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> +static int vfio_unpin_page_external(struct vfio_iommu *iommu,
> +				    struct vfio_dma *dma, dma_addr_t iova,
>  				    bool do_accounting)
>  {
>  	int unlocked;
>  	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
>  
> -	if (!vpfn)
> +	if (!vpfn || !vpfn->ref_count)
>  		return 0;
>  
>  	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> @@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  	unsigned long remote_vaddr;
>  	struct vfio_dma *dma;
>  	bool do_accounting;
> +	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>  
>  	if (!iommu || !user_pfn || !phys_pfn)
>  		return -EINVAL;
> @@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>  		if (vpfn) {
> -			phys_pfn[i] = vpfn->pfn;
> -			continue;
> +			if (vpfn->ref_count > 1) {
> +				phys_pfn[i] = vpfn->pfn;
> +				continue;
> +			}
>  		}
>  
>  		remote_vaddr = dma->vaddr + iova - dma->iova;
> @@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  					     do_accounting);
>  		if (ret)
>  			goto pin_unwind;
> -
> -		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> -		if (ret) {
> -			vfio_unpin_page_external(dma, iova, do_accounting);
> -			goto pin_unwind;
> +		if (!vpfn) {
> +			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> +			if (ret) {
> +				vfio_unpin_page_external(iommu, dma, iova,
> +							 do_accounting);
> +				goto pin_unwind;
> +			}
> +		} else
> +			vpfn->pfn = phys_pfn[i];
> +
> +		if (iommu->dirty_page_tracking && !dma->bitmap) {
> +			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
> +			if (ret) {
> +				vfio_unpin_page_external(iommu, dma, iova,
> +							 do_accounting);
> +				goto pin_unwind;
> +			}
>  		}
>  	}
>  
> @@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		iova = user_pfn[j] << PAGE_SHIFT;
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>  		phys_pfn[j] = 0;
>  	}
>  pin_done:
> @@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>  		if (!dma)
>  			goto unpin_exit;
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>  	}
>  
>  unpin_exit:
> @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	return bitmap;
>  }
>  
> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> +				  size_t size, uint64_t pgsize,
> +				  unsigned char __user *bitmap)
> +{
> +	struct vfio_dma *dma;
> +	dma_addr_t i = iova, iova_limit;
> +	unsigned int bsize, nbits = 0, l = 0;
> +	unsigned long pgshift = __ffs(pgsize);
> +
> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> +		int ret, j;
> +		unsigned int npages = 0, shift = 0;
> +		unsigned char temp = 0;
> +
> +		/* mark all pages dirty if all pages are pinned and mapped. */
> +		if (dma->iommu_mapped) {
> +			iova_limit = min(dma->iova + dma->size, iova + size);
> +			npages = iova_limit/pgsize;
> +			bitmap_set(dma->bitmap, 0, npages);
for pass-through devices, it's not good to always return all pinned pages as
dirty. could it also call vfio_pin_pages to track dirty pages? or any
other interface provided to do that?
> +		} else if (dma->bitmap) {
> +			struct rb_node *n = rb_first(&dma->pfn_list);
> +			bool found = false;
> +
> +			for (; n; n = rb_next(n)) {
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +						struct vfio_pfn, node);
> +				if (vpfn->iova >= i) {
> +					found = true;
> +					break;
> +				}
> +			}
> +
> +			if (!found) {
> +				i += dma->size;
> +				continue;
> +			}
> +
> +			for (; n; n = rb_next(n)) {
> +				unsigned int s;
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +						struct vfio_pfn, node);
> +
> +				if (vpfn->iova >= iova + size)
> +					break;
> +
> +				s = (vpfn->iova - dma->iova) >> pgshift;
> +				bitmap_set(dma->bitmap, s, 1);
> +
should not set the dma->bitmap when user space asks for dirty bitmap.
should set the bits for all vpfns when dirty page tracking starts, and clear
it after putting them to user space.
dma->bitmap is set when vfio_pin_pages is called during dirty page
tracking.
> +				iova_limit = vpfn->iova + pgsize;
> +			}
> +			npages = iova_limit/pgsize;
> +		}
> +
> +		bsize = dirty_bitmap_bytes(npages);
> +		shift = nbits % BITS_PER_BYTE;
> +
> +		if (npages && shift) {
> +			l--;
> +			if (!access_ok((void __user *)bitmap + l,
> +					sizeof(unsigned char)))
> +				return -EINVAL;
> +
> +			ret = __get_user(temp, bitmap + l);
> +			if (ret)
> +				return ret;
> +		}
> +
> +		for (j = 0; j < bsize; j++, l++) {
> +			temp = temp |
> +			       (*((unsigned char *)dma->bitmap + j) << shift);
> +			if (!access_ok((void __user *)bitmap + l,
> +					sizeof(unsigned char)))
> +				return -EINVAL;
> +
> +			ret = __put_user(temp, bitmap + l);
> +			if (ret)
> +				return ret;
> +			if (shift) {
> +				temp = *((unsigned char *)dma->bitmap + j) >>
> +					(BITS_PER_BYTE - shift);
> +			}
> +		}
> +
> +		nbits += npages;
> +
> +		i = min(dma->iova + dma->size, iova + size);
> +		if (i >= iova + size)
> +			break;
> +	}
> +	return 0;
> +}
> +
> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> +{
> +	long bsize;
> +
> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> +		return -EINVAL;
> +
> +	bsize = dirty_bitmap_bytes(npages);
> +
> +	if (bitmap_size < bsize)
> +		return -EINVAL;
> +
> +	return bsize;
> +}
> +
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  			     struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> +		struct vfio_iommu_type1_dirty_bitmap range;
> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +		int ret;
> +
> +		if (!iommu->v2)
> +			return -EACCES;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> +				    bitmap);
> +
> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (range.argsz < minsz || range.flags & ~mask)
> +			return -EINVAL;
> +
> +		/* only one flag should be set at a time */
> +		if (__ffs(range.flags) != __fls(range.flags))
> +			return -EINVAL;
> +
> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +
> +			mutex_lock(&iommu->lock);
> +			iommu->dirty_page_tracking = true;
should only set iommu->dirty_page_tracking = true after bitmap alloc
succeeds.
> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);
> +			mutex_unlock(&iommu->lock);
> +			return ret;
> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> +			mutex_lock(&iommu->lock);
> +			iommu->dirty_page_tracking = false;
> +			vfio_dma_all_bitmap_free(iommu);
> +			vfio_remove_unpinned_from_dma_list(iommu);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		} else if (range.flags &
> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> +			long bsize;
> +			unsigned long pgshift = __ffs(range.pgsize);
> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +			uint64_t iommu_pgmask =
> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
> +
> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
> +				return -EINVAL;
> +			if (range.iova & iommu_pgmask)
> +				return -EINVAL;
> +			if (!range.size || range.size & iommu_pgmask)
> +				return -EINVAL;
> +			if (range.iova + range.size < range.iova)
> +				return -EINVAL;
> +			if (!access_ok((void __user *)range.bitmap,
> +				       range.bitmap_size))
> +				return -EINVAL;
> +
> +			bsize = verify_bitmap_size(range.size >> pgshift,
> +						   range.bitmap_size);
> +			if (bsize < 0)
> +				return bsize;
> +
> +			mutex_lock(&iommu->lock);
> +			if (iommu->dirty_page_tracking)
> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
> +					 range.size, range.pgsize,
> +					 (unsigned char __user *)range.bitmap);
> +			else
> +				ret = -EINVAL;
> +			mutex_unlock(&iommu->lock);
> +
> +			return ret;
> +		}
>  	}
>  
>  	return -ENOTTY;
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
  2020-02-07 19:42   ` Kirti Wankhede
@ 2020-02-10 17:25     ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 17:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Sat, 8 Feb 2020 01:12:28 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> 
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access.
> 
> - Defined device states and state transition details.
> 
> - Defined sequence to be followed while saving and resuming VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 208 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 208 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a147ead..572242620ce9 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_TYPE_CCW			(2)
> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>  
>  /* sub-types for VFIO_REGION_TYPE_PCI_* */
>  
> @@ -379,6 +380,213 @@ struct vfio_region_gfx_edid {
>  /* sub-types for VFIO_REGION_TYPE_CCW */
>  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>  
> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> +
> +/*
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      - User application writes this field to inform vendor driver about the
> + *        device state to be transitioned to.
> + *      - Vendor driver should take necessary actions to change device state.
> + *        On successful transition to given state, vendor driver should return
> + *        success on write(device_state, state) system call. If device state
> + *        transition fails, vendor driver should return error, -EFAULT.

s/error, -EFAULT/an appropriate -errno for the fault condition/

> + *      - On user application side, if device state transition fails, i.e. if
> + *        write(device_state, state) returns error, read device_state again to
> + *        determine the current state of the device from vendor driver.
> + *      - Vendor driver should return previous state of the device unless vendor
> + *        driver has encountered an internal error, in which case vendor driver
> + *        may report the device_state VFIO_DEVICE_STATE_ERROR.
> + *	- User application must use the device reset ioctl in order to recover
> + *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
> + *	  indicated in a valid device state via reading device_state, the user
> + *	  application may decide attempt to transition the device to any valid
> + *	  state reachable from the current state or terminate itself.
> + *
> + *      device_state consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
> + *	  indicates _STOP state. When device is changed to _STOP, driver should
> + *	  stop device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> + *        should start gathering device state information which will be provided
> + *        to VFIO user application to save device's state.
> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> + *        prepare to resume device, data provided through migration region
> + *        should be used to resume device.
> + *      Bits 3 - 31 are reserved for future use. In order to preserve them,
> + *	user application should perform read-modify-write operation on this
> + *	field when modifying the specified bits.
> + *
> + *  +------- _RESUMING
> + *  |+------ _SAVING
> + *  ||+----- _RUNNING
> + *  |||
> + *  000b => Device Stopped, not saving or resuming
> + *  001b => Device running state, default state
> + *  010b => Stop Device & save device state, stop-and-copy state
> + *  011b => Device running and save device state, pre-copy state
> + *  100b => Device stopped and device state is resuming
> + *  101b => Invalid state
> + *  110b => Error state
> + *  111b => Invalid state
> + *
> + * State transitions:
> + *
> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> + *                (100b)     (001b)     (011b)        (010b)       (000b)
> + * 0. Running or Default state
> + *                             |
> + *
> + * 1. Normal Shutdown (optional)
> + *                             |------------------------------------->|
> + *
> + * 2. Save state or Suspend
> + *                             |------------------------->|---------->|
> + *
> + * 3. Save state during live migration
> + *                             |----------->|------------>|---------->|
> + *
> + * 4. Resuming
> + *                  |<---------|
> + *
> + * 5. Resumed
> + *                  |--------->|
> + *
> + * 0. Default state of VFIO device is _RUNNNG when user application starts.
> + * 1. During normal user application shutdown, vfio device state changes
> + *    from _RUNNING to _STOP. This is optional, user application may or may not
> + *    perform this state transition and vendor driver may not need.

s/may not need/must not require, but must support this transition/

> + * 2. When user application save state or suspend application, device state
> + *    transitions from _RUNNING to stop-and-copy state and then to _STOP.
> + *    On state transition from _RUNNING to stop-and-copy, driver must
> + *    stop device, save device state and send it to application through
> + *    migration region. Sequence to be followed for such transition is given
> + *    below.
> + * 3. In user application live migration, state transitions from _RUNNING
> + *    to pre-copy to stop-and-copy to _STOP.
> + *    On state transition from _RUNNING to pre-copy, driver should start
> + *    gathering device state while application is still running and send device
> + *    state data to application through migration region.
> + *    On state transition from pre-copy to stop-and-copy, driver must stop
> + *    device, save device state and send it to user application through
> + *    migration region.
> + *    Sequence to be followed for above two transitions is given below.

Perhaps adding something like "Vendor drivers must support the pre-copy
state even for implementations where no data is provided to the user
until the stop-and-copy state.  The user must not be required to
consume all migration data prior to transitioning to a new state,
including the stop-and-copy state."

> + * 4. To start resuming phase, device state should be transitioned from
> + *    _RUNNING to _RESUMING state.
> + *    In _RESUMING state, driver should use received device state data through
> + *    migration region to resume device.
> + * 5. On providing saved device data to driver, application should change state
> + *    from _RESUMING to _RUNNING.
> + *
> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state. See below for detail of sequence to
> + *      be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      bytes in migration region during _SAVING state and write size of data
> + *      copied in bytes in migration region during _RESUMING state.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^
> + *  offset 0-trapped part        data_offset
> + *
> + * Structure vfio_device_migration_info is always followed by data section in
> + * the region, so data_offset will always be non-0. Offset from where data is
> + * copied is decided by kernel driver, data section can be trapped or mapped
> + * or partitioned, depending on how kernel driver defines data section.
> + * Data section partition can be defined as mapped by sparse mmap capability.
> + * If mmapped, then data_offset should be page aligned, where as initial section
> + * which contain vfio_device_migration_info structure might not end at offset
> + * which is page aligned. The user is not required to access via mmap regardless
> + * of the region mmap capabilities.
> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.
> + *
> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> + * and for _SAVING device state or stop-and-copy phase:
> + * a. read pending_bytes, indicates start of new iteration to get device data.
> + *    Repeatative read on pending_bytes at this stage should have no side
> + *    effect.

s/Repeatative/Repeated/

> + *    If pending_bytes == 0, user application should not iterate to get data
> + *    for that device.
> + *    If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates vendor driver to make data available through
> + *    data section. Vendor driver should return this read operation only after
> + *    data is available from (region + data_offset) to (region + data_offset +
> + *    data_size).
> + * c. read data_size, amount of data in bytes available through migration
> + *    region.
> + *    Read on data_offset and data_size should return offset and size of current
> + *    buffer if user application reads those more than once here.
> + * d. read data of data_size bytes from (region + data_offset) from migration
> + *    region.
> + * e. process data.
> + * f. read pending_bytes, this read operation indicates data from previous
> + *    iteration had read. If pending_bytes > 0, goto step b.
> + *
> + * If there is any error during the above sequence, vendor driver can return
> + * error code for next read()/write() operation, that will terminate the loop
> + * and user should take next necessary action, for example, fail migration or
> + * terminate user application.
> + *
> + * User application can transition from _SAVING|_RUNNING (pre-copy state) to
> + * _SAVING (stop-and-copy) state regardless of pending bytes.

Ok, you cover one of my concerns above here.  Maybe doesn't hurt to
mention in both places.

> + * User application should iterate in _SAVING (stop-and-copy) until
> + * pending_bytes is 0.
> + *
> + * Sequence to be followed while _RESUMING device state:
> + * While data for this device is available, repeat below steps:
> + * a. read data_offset from where user application should write data.
> + * b. write data of data_size to migration region from data_offset. Data size
> + *    should be data packet size at source during _SAVING.

I find the reference to data_size a bit confusing in this wording,
almost as if it's implied that the user reads data_size on the target.
What if we changed it a little:

 b. write migration data starting at migration region + data_offset for
 length determined by data_size from the migration source.

> + * c. write data_size which indicates vendor driver that data is written in
> + *    migration region. Vendor driver should read this data from migration
> + *    region and resume device's state.

Perhaps "Vendor driver should apply the user provided migration region
data towards the device resume state"?

> + *
> + * For user application, data is opaque. User application should write data in
> + * the same order as received and should of same transaction size at source.

Great!

> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_STOP      (0)
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +				     VFIO_DEVICE_STATE_SAVING |  \
> +				     VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_VALID(state) \
> +	(state & VFIO_DEVICE_STATE_RESUMING ? \
> +	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> +
> +#define VFIO_DEVICE_STATE_ERROR			\
> +		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)

It looks like this isn't used in this series, so I'm not sure the
intention of this macro, but I think we decided to only use 110b as the
"error" state.  So should this be something like

#define VFIO_DEVICE_STATE_IS_ERROR(state) \
	(state & VFIO_DEVICE_STATE_MASK == (VFIO_DEVICE_STATE_SAVING | \
					    VFIO_DEVICE_STATE_RESUMING))

Or if this was intended to be used in setting the device_state to
error, perhaps

#define VFIO_DEVICE_STATE_SET_ERROR(state) \
	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
					     VFIO_DEVICE_STATE_RESUMING)
> +
> +	__u32 reserved;

Can we specify this reserved field as reads return zero, writes are
ignored so that we give ourselves the opportunity to re-purpose it
later?

> +	__u64 pending_bytes;
> +	__u64 data_offset;
> +	__u64 data_size;
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within

Thanks,
Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
@ 2020-02-10 17:25     ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 17:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 8 Feb 2020 01:12:28 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> 
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access.
> 
> - Defined device states and state transition details.
> 
> - Defined sequence to be followed while saving and resuming VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 208 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 208 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a147ead..572242620ce9 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_TYPE_CCW			(2)
> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>  
>  /* sub-types for VFIO_REGION_TYPE_PCI_* */
>  
> @@ -379,6 +380,213 @@ struct vfio_region_gfx_edid {
>  /* sub-types for VFIO_REGION_TYPE_CCW */
>  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>  
> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> +
> +/*
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      - User application writes this field to inform vendor driver about the
> + *        device state to be transitioned to.
> + *      - Vendor driver should take necessary actions to change device state.
> + *        On successful transition to given state, vendor driver should return
> + *        success on write(device_state, state) system call. If device state
> + *        transition fails, vendor driver should return error, -EFAULT.

s/error, -EFAULT/an appropriate -errno for the fault condition/

> + *      - On user application side, if device state transition fails, i.e. if
> + *        write(device_state, state) returns error, read device_state again to
> + *        determine the current state of the device from vendor driver.
> + *      - Vendor driver should return previous state of the device unless vendor
> + *        driver has encountered an internal error, in which case vendor driver
> + *        may report the device_state VFIO_DEVICE_STATE_ERROR.
> + *	- User application must use the device reset ioctl in order to recover
> + *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
> + *	  indicated in a valid device state via reading device_state, the user
> + *	  application may decide attempt to transition the device to any valid
> + *	  state reachable from the current state or terminate itself.
> + *
> + *      device_state consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
> + *	  indicates _STOP state. When device is changed to _STOP, driver should
> + *	  stop device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> + *        should start gathering device state information which will be provided
> + *        to VFIO user application to save device's state.
> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> + *        prepare to resume device, data provided through migration region
> + *        should be used to resume device.
> + *      Bits 3 - 31 are reserved for future use. In order to preserve them,
> + *	user application should perform read-modify-write operation on this
> + *	field when modifying the specified bits.
> + *
> + *  +------- _RESUMING
> + *  |+------ _SAVING
> + *  ||+----- _RUNNING
> + *  |||
> + *  000b => Device Stopped, not saving or resuming
> + *  001b => Device running state, default state
> + *  010b => Stop Device & save device state, stop-and-copy state
> + *  011b => Device running and save device state, pre-copy state
> + *  100b => Device stopped and device state is resuming
> + *  101b => Invalid state
> + *  110b => Error state
> + *  111b => Invalid state
> + *
> + * State transitions:
> + *
> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> + *                (100b)     (001b)     (011b)        (010b)       (000b)
> + * 0. Running or Default state
> + *                             |
> + *
> + * 1. Normal Shutdown (optional)
> + *                             |------------------------------------->|
> + *
> + * 2. Save state or Suspend
> + *                             |------------------------->|---------->|
> + *
> + * 3. Save state during live migration
> + *                             |----------->|------------>|---------->|
> + *
> + * 4. Resuming
> + *                  |<---------|
> + *
> + * 5. Resumed
> + *                  |--------->|
> + *
> + * 0. Default state of VFIO device is _RUNNNG when user application starts.
> + * 1. During normal user application shutdown, vfio device state changes
> + *    from _RUNNING to _STOP. This is optional, user application may or may not
> + *    perform this state transition and vendor driver may not need.

s/may not need/must not require, but must support this transition/

> + * 2. When user application save state or suspend application, device state
> + *    transitions from _RUNNING to stop-and-copy state and then to _STOP.
> + *    On state transition from _RUNNING to stop-and-copy, driver must
> + *    stop device, save device state and send it to application through
> + *    migration region. Sequence to be followed for such transition is given
> + *    below.
> + * 3. In user application live migration, state transitions from _RUNNING
> + *    to pre-copy to stop-and-copy to _STOP.
> + *    On state transition from _RUNNING to pre-copy, driver should start
> + *    gathering device state while application is still running and send device
> + *    state data to application through migration region.
> + *    On state transition from pre-copy to stop-and-copy, driver must stop
> + *    device, save device state and send it to user application through
> + *    migration region.
> + *    Sequence to be followed for above two transitions is given below.

Perhaps adding something like "Vendor drivers must support the pre-copy
state even for implementations where no data is provided to the user
until the stop-and-copy state.  The user must not be required to
consume all migration data prior to transitioning to a new state,
including the stop-and-copy state."

> + * 4. To start resuming phase, device state should be transitioned from
> + *    _RUNNING to _RESUMING state.
> + *    In _RESUMING state, driver should use received device state data through
> + *    migration region to resume device.
> + * 5. On providing saved device data to driver, application should change state
> + *    from _RESUMING to _RUNNING.
> + *
> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state. See below for detail of sequence to
> + *      be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      bytes in migration region during _SAVING state and write size of data
> + *      copied in bytes in migration region during _RESUMING state.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^
> + *  offset 0-trapped part        data_offset
> + *
> + * Structure vfio_device_migration_info is always followed by data section in
> + * the region, so data_offset will always be non-0. Offset from where data is
> + * copied is decided by kernel driver, data section can be trapped or mapped
> + * or partitioned, depending on how kernel driver defines data section.
> + * Data section partition can be defined as mapped by sparse mmap capability.
> + * If mmapped, then data_offset should be page aligned, where as initial section
> + * which contain vfio_device_migration_info structure might not end at offset
> + * which is page aligned. The user is not required to access via mmap regardless
> + * of the region mmap capabilities.
> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.
> + *
> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> + * and for _SAVING device state or stop-and-copy phase:
> + * a. read pending_bytes, indicates start of new iteration to get device data.
> + *    Repeatative read on pending_bytes at this stage should have no side
> + *    effect.

s/Repeatative/Repeated/

> + *    If pending_bytes == 0, user application should not iterate to get data
> + *    for that device.
> + *    If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates vendor driver to make data available through
> + *    data section. Vendor driver should return this read operation only after
> + *    data is available from (region + data_offset) to (region + data_offset +
> + *    data_size).
> + * c. read data_size, amount of data in bytes available through migration
> + *    region.
> + *    Read on data_offset and data_size should return offset and size of current
> + *    buffer if user application reads those more than once here.
> + * d. read data of data_size bytes from (region + data_offset) from migration
> + *    region.
> + * e. process data.
> + * f. read pending_bytes, this read operation indicates data from previous
> + *    iteration had read. If pending_bytes > 0, goto step b.
> + *
> + * If there is any error during the above sequence, vendor driver can return
> + * error code for next read()/write() operation, that will terminate the loop
> + * and user should take next necessary action, for example, fail migration or
> + * terminate user application.
> + *
> + * User application can transition from _SAVING|_RUNNING (pre-copy state) to
> + * _SAVING (stop-and-copy) state regardless of pending bytes.

Ok, you cover one of my concerns above here.  Maybe doesn't hurt to
mention in both places.

> + * User application should iterate in _SAVING (stop-and-copy) until
> + * pending_bytes is 0.
> + *
> + * Sequence to be followed while _RESUMING device state:
> + * While data for this device is available, repeat below steps:
> + * a. read data_offset from where user application should write data.
> + * b. write data of data_size to migration region from data_offset. Data size
> + *    should be data packet size at source during _SAVING.

I find the reference to data_size a bit confusing in this wording,
almost as if it's implied that the user reads data_size on the target.
What if we changed it a little:

 b. write migration data starting at migration region + data_offset for
 length determined by data_size from the migration source.

> + * c. write data_size which indicates vendor driver that data is written in
> + *    migration region. Vendor driver should read this data from migration
> + *    region and resume device's state.

Perhaps "Vendor driver should apply the user provided migration region
data towards the device resume state"?

> + *
> + * For user application, data is opaque. User application should write data in
> + * the same order as received and should of same transaction size at source.

Great!

> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_STOP      (0)
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +				     VFIO_DEVICE_STATE_SAVING |  \
> +				     VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_VALID(state) \
> +	(state & VFIO_DEVICE_STATE_RESUMING ? \
> +	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> +
> +#define VFIO_DEVICE_STATE_ERROR			\
> +		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)

It looks like this isn't used in this series, so I'm not sure the
intention of this macro, but I think we decided to only use 110b as the
"error" state.  So should this be something like

#define VFIO_DEVICE_STATE_IS_ERROR(state) \
	(state & VFIO_DEVICE_STATE_MASK == (VFIO_DEVICE_STATE_SAVING | \
					    VFIO_DEVICE_STATE_RESUMING))

Or if this was intended to be used in setting the device_state to
error, perhaps

#define VFIO_DEVICE_STATE_SET_ERROR(state) \
	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
					     VFIO_DEVICE_STATE_RESUMING)
> +
> +	__u32 reserved;

Can we specify this reserved field as reads return zero, writes are
ignored so that we give ourselves the opportunity to re-purpose it
later?

> +	__u64 pending_bytes;
> +	__u64 data_offset;
> +	__u64 data_size;
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within

Thanks,
Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-07 19:42   ` Kirti Wankhede
@ 2020-02-10 17:25     ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 17:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Sat, 8 Feb 2020 01:12:31 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> - Start pinned and unpinned pages tracking while migration is active
> - Stop pinned and unpinned dirty pages tracking. This is also used to
>   stop dirty pages tracking if migration failed or cancelled.
> - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
>   user space application responsibility to copy content of dirty pages
>   from source to destination during migration.
> 
> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> structure. Bitmap size is calculated considering smallest supported page
> size. Bitmap is allocated when dirty logging is enabled for those
> vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> case of pass-through device.
> 
> There could be multiple option as to when bitmap should be populated:
> * Polulate bitmap for already pinned pages when bitmap is allocated for
>   a vfio_dma with the smallest supported page size. Updates bitmap from
>   page pinning and unpinning functions. When user application queries
>   bitmap, check if requested page size is same as page size used to
>   populated bitmap. If it is equal, copy bitmap. But if not equal,
>   re-populated bitmap according to requested page size and then copy to
>   user.
>   Pros: Bitmap gets populated on the fly after dirty tracking has
>         started.
>   Cons: If requested page size is different than smallest supported
>         page size, then bitmap has to be re-populated again, with
>         additional overhead of allocating bitmap memory again for
>         re-population of bitmap.

No memory needs to be allocated to re-populate the bitmap.  The bitmap
is clear-on-read and by tracking the bitmap in the smallest supported
page size we can guarantee that we can fit the user requested bitmap
size within the space occupied by that minimal page size range of the
bitmap.  Therefore we'd destructively translate the requested region of
the bitmap to a different page size, write it out to the user, and
clear it.  Also we expect userspace to use the minimum page size almost
exclusively, which is optimized by this approach as dirty bit tracking
is spread out over each page pinning operation.

> 
> * Populate bitmap when bitmap is queried by user application.
>   Pros: Bitmap is populated with requested page size. This eliminates
>         the need to re-populate bitmap if requested page size is
>         different than smallest supported pages size.
>   Cons: There is one time processing time, when bitmap is queried.

Another significant Con is that the vpfn list needs to track and manage
unpinned pages, which makes it more complex and intrusive.  The
previous option seems to have both time and complexity advantages,
especially in the case we expect to be most common of the user
accessing the bitmap with the minimum page size, ie. PAGE_SIZE.  It's
also not clear why we pre-allocate the bitmap at all with this approach.

> I prefer later option with simple logic and to eliminate over-head of
> bitmap repopulation in case of differnt page sizes. Later option is
> implemented in this patch.

Hmm, we'll see below, but I not convinced based on the above rationale.

> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 287 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d386461e5d11..df358dc1c85b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
>  	unsigned int		dma_avail;
>  	bool			v2;
>  	bool			nesting;
> +	bool			dirty_page_tracking;
>  };
>  
>  struct vfio_domain {
> @@ -90,6 +91,7 @@ struct vfio_dma {
>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> +	unsigned long		*bitmap;
>  };
>  
>  struct vfio_group {
> @@ -125,6 +127,7 @@ struct vfio_regions {
>  					(!list_empty(&iommu->domain_list))
>  
>  static int put_pfn(unsigned long pfn, int prot);
> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
>  
>  /*
>   * This code handles mapping and unmapping of user data buffers
> @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
> +{
> +	if (!npages)
> +		return 0;
> +
> +	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> +}
> +
> +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
> +				 struct vfio_dma *dma, unsigned long pgsizes)
> +{
> +	unsigned long pgshift = __ffs(pgsizes);
> +
> +	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
> +		unsigned long npages = dma->size >> pgshift;
> +		unsigned long bsize = dirty_bitmap_bytes(npages);
> +
> +		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);

nit, we don't need to store bsize in a local variable.

> +		if (!dma->bitmap)
> +			return -ENOMEM;
> +	}
> +	return 0;
> +}
> +
> +static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
> +				     unsigned long pgsizes)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +	int ret;
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
> +		if (ret)
> +			return ret;

This doesn't unwind on failure, so we're left with partially allocated
bitmap cruft.

> +	}
> +	return 0;
> +}
> +
> +static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		kfree(dma->bitmap);

We don't set dma->bitmap = NULL and we don't even prevent the case of a
user making multiple STOP calls, so we have a user triggerable double
free :(

> +	}
> +}
> +
>  /*
>   * Helper Functions for host iova-pfn list
>   */
> @@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>  	kfree(vpfn);
>  }
>  
> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
> +{
> +	struct rb_node *n = rb_first(&dma->pfn_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> +
> +		if (!vpfn->ref_count)
> +			vfio_remove_from_pfn_list(dma, vpfn);
> +	}
> +}
> +
> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		vfio_remove_unpinned_from_pfn_list(dma);
> +	}
> +}
> +
>  static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>  					       unsigned long iova)
>  {
> @@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>  	vpfn->ref_count--;
>  	if (!vpfn->ref_count) {
>  		ret = put_pfn(vpfn->pfn, dma->prot);
> -		vfio_remove_from_pfn_list(dma, vpfn);
> +		if (!dma->bitmap)
> +			vfio_remove_from_pfn_list(dma, vpfn);
>  	}
>  	return ret;
>  }
> @@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>  	return ret;
>  }
>  
> -static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> +static int vfio_unpin_page_external(struct vfio_iommu *iommu,

We added a parameter but didn't use it in this patch.

> +				    struct vfio_dma *dma, dma_addr_t iova,
>  				    bool do_accounting)
>  {
>  	int unlocked;
>  	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
>  
> -	if (!vpfn)
> +	if (!vpfn || !vpfn->ref_count)
>  		return 0;
>  
>  	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> @@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  	unsigned long remote_vaddr;
>  	struct vfio_dma *dma;
>  	bool do_accounting;
> +	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>  
>  	if (!iommu || !user_pfn || !phys_pfn)
>  		return -EINVAL;
> @@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>  		if (vpfn) {
> -			phys_pfn[i] = vpfn->pfn;
> -			continue;
> +			if (vpfn->ref_count > 1) {
> +				phys_pfn[i] = vpfn->pfn;
> +				continue;
> +			}
>  		}
>  
>  		remote_vaddr = dma->vaddr + iova - dma->iova;
> @@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  					     do_accounting);
>  		if (ret)
>  			goto pin_unwind;
> -
> -		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> -		if (ret) {
> -			vfio_unpin_page_external(dma, iova, do_accounting);
> -			goto pin_unwind;
> +		if (!vpfn) {
> +			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> +			if (ret) {
> +				vfio_unpin_page_external(iommu, dma, iova,
> +							 do_accounting);
> +				goto pin_unwind;
> +			}
> +		} else
> +			vpfn->pfn = phys_pfn[i];
> +
> +		if (iommu->dirty_page_tracking && !dma->bitmap) {
> +			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
> +			if (ret) {
> +				vfio_unpin_page_external(iommu, dma, iova,
> +							 do_accounting);
> +				goto pin_unwind;
> +			}
>  		}
>  	}
>  
> @@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		iova = user_pfn[j] << PAGE_SHIFT;
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>  		phys_pfn[j] = 0;
>  	}
>  pin_done:
> @@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>  		if (!dma)
>  			goto unpin_exit;
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>  	}
>  
>  unpin_exit:
> @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	return bitmap;
>  }
>  
> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> +				  size_t size, uint64_t pgsize,
> +				  unsigned char __user *bitmap)
> +{
> +	struct vfio_dma *dma;
> +	dma_addr_t i = iova, iova_limit;
> +	unsigned int bsize, nbits = 0, l = 0;
> +	unsigned long pgshift = __ffs(pgsize);
> +
> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> +		int ret, j;
> +		unsigned int npages = 0, shift = 0;
> +		unsigned char temp = 0;
> +
> +		/* mark all pages dirty if all pages are pinned and mapped. */
> +		if (dma->iommu_mapped) {
> +			iova_limit = min(dma->iova + dma->size, iova + size);
> +			npages = iova_limit/pgsize;
> +			bitmap_set(dma->bitmap, 0, npages);

npages is derived from iova_limit, which is the number of bits to set
dirty relative to the first requested iova, not iova zero, ie. the set
of dirty bits is offset from those requested unless iova == dma->iova.

Also I hope dma->bitmap was actually allocated.  Not only does the
START error path potentially leave dirty tracking enabled without all
the bitmap allocated, when does the bitmap get allocated for a new
vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
vpfn gets marked dirty.

> +		} else if (dma->bitmap) {
> +			struct rb_node *n = rb_first(&dma->pfn_list);
> +			bool found = false;
> +
> +			for (; n; n = rb_next(n)) {
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +						struct vfio_pfn, node);
> +				if (vpfn->iova >= i) {
> +					found = true;
> +					break;
> +				}
> +			}
> +
> +			if (!found) {
> +				i += dma->size;
> +				continue;
> +			}
> +
> +			for (; n; n = rb_next(n)) {
> +				unsigned int s;
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +						struct vfio_pfn, node);
> +
> +				if (vpfn->iova >= iova + size)
> +					break;
> +
> +				s = (vpfn->iova - dma->iova) >> pgshift;
> +				bitmap_set(dma->bitmap, s, 1);
> +
> +				iova_limit = vpfn->iova + pgsize;
> +			}
> +			npages = iova_limit/pgsize;

Isn't iova_limit potentially uninitialized here?  For example, if our
vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
(4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
4096 and break, and npages = ????/pgsize.

> +		}
> +
> +		bsize = dirty_bitmap_bytes(npages);
> +		shift = nbits % BITS_PER_BYTE;
> +
> +		if (npages && shift) {
> +			l--;
> +			if (!access_ok((void __user *)bitmap + l,
> +					sizeof(unsigned char)))
> +				return -EINVAL;
> +
> +			ret = __get_user(temp, bitmap + l);

I don't understand why we care to get the user's bitmap, are we trying
to leave whatever garbage they might have set in it and only also set
the dirty bits?  That seems unnecessary.

Also why do we need these access_ok() checks when we already checked
the range at the start of the ioctl?

> +			if (ret)
> +				return ret;
> +		}
> +
> +		for (j = 0; j < bsize; j++, l++) {
> +			temp = temp |
> +			       (*((unsigned char *)dma->bitmap + j) << shift);

|=

> +			if (!access_ok((void __user *)bitmap + l,
> +					sizeof(unsigned char)))
> +				return -EINVAL;
> +
> +			ret = __put_user(temp, bitmap + l);
> +			if (ret)
> +				return ret;
> +			if (shift) {
> +				temp = *((unsigned char *)dma->bitmap + j) >>
> +					(BITS_PER_BYTE - shift);
> +			}

When shift == 0, temp just seems to accumulate bits that never get
cleared.

> +		}
> +
> +		nbits += npages;
> +
> +		i = min(dma->iova + dma->size, iova + size);
> +		if (i >= iova + size)
> +			break;

So whether we error or succeed, we leave cruft in dma->bitmap for the
next pass.  It doesn't seem to make any sense why we pre-allocated the
bitmap, we might as well just allocate it on demand here.  Actually, if
we're not going to do a copy_to_user() for some range of the bitmap,
I'm not sure what it's purpose is at all.  I think the big advantages
of the bitmap are that we can't amortize the cost across every pinned
page or DMA mapping, we don't need the overhead of tracking unmapped
vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
getting any of those advantages here.

> +	}
> +	return 0;
> +}
> +
> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> +{
> +	long bsize;
> +
> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> +		return -EINVAL;
> +
> +	bsize = dirty_bitmap_bytes(npages);
> +
> +	if (bitmap_size < bsize)
> +		return -EINVAL;
> +
> +	return bsize;
> +}

Seems like this could simply return int, -errno or zero for success.
The returned bsize is not used for anything else.

> +
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  			     struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> +		struct vfio_iommu_type1_dirty_bitmap range;
> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +		int ret;
> +
> +		if (!iommu->v2)
> +			return -EACCES;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> +				    bitmap);

We require the user to provide iova, size, pgsize, bitmap_size, and
bitmap fields to START/STOP?  Why?

> +
> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (range.argsz < minsz || range.flags & ~mask)
> +			return -EINVAL;
> +
> +		/* only one flag should be set at a time */
> +		if (__ffs(range.flags) != __fls(range.flags))
> +			return -EINVAL;
> +
> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +
> +			mutex_lock(&iommu->lock);
> +			iommu->dirty_page_tracking = true;
> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);

So dirty page tracking is enabled even if we fail to allocate all the
bitmaps?  Shouldn't this return an error if dirty tracking is already
enabled?

> +			mutex_unlock(&iommu->lock);
> +			return ret;
> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> +			mutex_lock(&iommu->lock);
> +			iommu->dirty_page_tracking = false;

Shouldn't we only allow STOP if tracking is enabled?

> +			vfio_dma_all_bitmap_free(iommu);

Here's where that user induced double free enters the picture.

> +			vfio_remove_unpinned_from_dma_list(iommu);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		} else if (range.flags &
> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> +			long bsize;
> +			unsigned long pgshift = __ffs(range.pgsize);
> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +			uint64_t iommu_pgmask =
> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
> +
> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
> +				return -EINVAL;
> +			if (range.iova & iommu_pgmask)
> +				return -EINVAL;
> +			if (!range.size || range.size & iommu_pgmask)
> +				return -EINVAL;
> +			if (range.iova + range.size < range.iova)
> +				return -EINVAL;
> +			if (!access_ok((void __user *)range.bitmap,
> +				       range.bitmap_size))
> +				return -EINVAL;
> +
> +			bsize = verify_bitmap_size(range.size >> pgshift,
> +						   range.bitmap_size);
> +			if (bsize < 0)
> +				return bsize;
> +
> +			mutex_lock(&iommu->lock);
> +			if (iommu->dirty_page_tracking)
> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
> +					 range.size, range.pgsize,
> +					 (unsigned char __user *)range.bitmap);
> +			else
> +				ret = -EINVAL;
> +			mutex_unlock(&iommu->lock);
> +
> +			return ret;
> +		}
>  	}
>  
>  	return -ENOTTY;

Thanks,
Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-10 17:25     ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 17:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 8 Feb 2020 01:12:31 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> - Start pinned and unpinned pages tracking while migration is active
> - Stop pinned and unpinned dirty pages tracking. This is also used to
>   stop dirty pages tracking if migration failed or cancelled.
> - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
>   user space application responsibility to copy content of dirty pages
>   from source to destination during migration.
> 
> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> structure. Bitmap size is calculated considering smallest supported page
> size. Bitmap is allocated when dirty logging is enabled for those
> vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> case of pass-through device.
> 
> There could be multiple option as to when bitmap should be populated:
> * Polulate bitmap for already pinned pages when bitmap is allocated for
>   a vfio_dma with the smallest supported page size. Updates bitmap from
>   page pinning and unpinning functions. When user application queries
>   bitmap, check if requested page size is same as page size used to
>   populated bitmap. If it is equal, copy bitmap. But if not equal,
>   re-populated bitmap according to requested page size and then copy to
>   user.
>   Pros: Bitmap gets populated on the fly after dirty tracking has
>         started.
>   Cons: If requested page size is different than smallest supported
>         page size, then bitmap has to be re-populated again, with
>         additional overhead of allocating bitmap memory again for
>         re-population of bitmap.

No memory needs to be allocated to re-populate the bitmap.  The bitmap
is clear-on-read and by tracking the bitmap in the smallest supported
page size we can guarantee that we can fit the user requested bitmap
size within the space occupied by that minimal page size range of the
bitmap.  Therefore we'd destructively translate the requested region of
the bitmap to a different page size, write it out to the user, and
clear it.  Also we expect userspace to use the minimum page size almost
exclusively, which is optimized by this approach as dirty bit tracking
is spread out over each page pinning operation.

> 
> * Populate bitmap when bitmap is queried by user application.
>   Pros: Bitmap is populated with requested page size. This eliminates
>         the need to re-populate bitmap if requested page size is
>         different than smallest supported pages size.
>   Cons: There is one time processing time, when bitmap is queried.

Another significant Con is that the vpfn list needs to track and manage
unpinned pages, which makes it more complex and intrusive.  The
previous option seems to have both time and complexity advantages,
especially in the case we expect to be most common of the user
accessing the bitmap with the minimum page size, ie. PAGE_SIZE.  It's
also not clear why we pre-allocate the bitmap at all with this approach.

> I prefer later option with simple logic and to eliminate over-head of
> bitmap repopulation in case of differnt page sizes. Later option is
> implemented in this patch.

Hmm, we'll see below, but I not convinced based on the above rationale.

> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 287 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d386461e5d11..df358dc1c85b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
>  	unsigned int		dma_avail;
>  	bool			v2;
>  	bool			nesting;
> +	bool			dirty_page_tracking;
>  };
>  
>  struct vfio_domain {
> @@ -90,6 +91,7 @@ struct vfio_dma {
>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> +	unsigned long		*bitmap;
>  };
>  
>  struct vfio_group {
> @@ -125,6 +127,7 @@ struct vfio_regions {
>  					(!list_empty(&iommu->domain_list))
>  
>  static int put_pfn(unsigned long pfn, int prot);
> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
>  
>  /*
>   * This code handles mapping and unmapping of user data buffers
> @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
> +{
> +	if (!npages)
> +		return 0;
> +
> +	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> +}
> +
> +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
> +				 struct vfio_dma *dma, unsigned long pgsizes)
> +{
> +	unsigned long pgshift = __ffs(pgsizes);
> +
> +	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
> +		unsigned long npages = dma->size >> pgshift;
> +		unsigned long bsize = dirty_bitmap_bytes(npages);
> +
> +		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);

nit, we don't need to store bsize in a local variable.

> +		if (!dma->bitmap)
> +			return -ENOMEM;
> +	}
> +	return 0;
> +}
> +
> +static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
> +				     unsigned long pgsizes)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +	int ret;
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
> +		if (ret)
> +			return ret;

This doesn't unwind on failure, so we're left with partially allocated
bitmap cruft.

> +	}
> +	return 0;
> +}
> +
> +static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		kfree(dma->bitmap);

We don't set dma->bitmap = NULL and we don't even prevent the case of a
user making multiple STOP calls, so we have a user triggerable double
free :(

> +	}
> +}
> +
>  /*
>   * Helper Functions for host iova-pfn list
>   */
> @@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>  	kfree(vpfn);
>  }
>  
> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
> +{
> +	struct rb_node *n = rb_first(&dma->pfn_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> +
> +		if (!vpfn->ref_count)
> +			vfio_remove_from_pfn_list(dma, vpfn);
> +	}
> +}
> +
> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		vfio_remove_unpinned_from_pfn_list(dma);
> +	}
> +}
> +
>  static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>  					       unsigned long iova)
>  {
> @@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>  	vpfn->ref_count--;
>  	if (!vpfn->ref_count) {
>  		ret = put_pfn(vpfn->pfn, dma->prot);
> -		vfio_remove_from_pfn_list(dma, vpfn);
> +		if (!dma->bitmap)
> +			vfio_remove_from_pfn_list(dma, vpfn);
>  	}
>  	return ret;
>  }
> @@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>  	return ret;
>  }
>  
> -static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> +static int vfio_unpin_page_external(struct vfio_iommu *iommu,

We added a parameter but didn't use it in this patch.

> +				    struct vfio_dma *dma, dma_addr_t iova,
>  				    bool do_accounting)
>  {
>  	int unlocked;
>  	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
>  
> -	if (!vpfn)
> +	if (!vpfn || !vpfn->ref_count)
>  		return 0;
>  
>  	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> @@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  	unsigned long remote_vaddr;
>  	struct vfio_dma *dma;
>  	bool do_accounting;
> +	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>  
>  	if (!iommu || !user_pfn || !phys_pfn)
>  		return -EINVAL;
> @@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>  		if (vpfn) {
> -			phys_pfn[i] = vpfn->pfn;
> -			continue;
> +			if (vpfn->ref_count > 1) {
> +				phys_pfn[i] = vpfn->pfn;
> +				continue;
> +			}
>  		}
>  
>  		remote_vaddr = dma->vaddr + iova - dma->iova;
> @@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  					     do_accounting);
>  		if (ret)
>  			goto pin_unwind;
> -
> -		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> -		if (ret) {
> -			vfio_unpin_page_external(dma, iova, do_accounting);
> -			goto pin_unwind;
> +		if (!vpfn) {
> +			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> +			if (ret) {
> +				vfio_unpin_page_external(iommu, dma, iova,
> +							 do_accounting);
> +				goto pin_unwind;
> +			}
> +		} else
> +			vpfn->pfn = phys_pfn[i];
> +
> +		if (iommu->dirty_page_tracking && !dma->bitmap) {
> +			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
> +			if (ret) {
> +				vfio_unpin_page_external(iommu, dma, iova,
> +							 do_accounting);
> +				goto pin_unwind;
> +			}
>  		}
>  	}
>  
> @@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		iova = user_pfn[j] << PAGE_SHIFT;
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>  		phys_pfn[j] = 0;
>  	}
>  pin_done:
> @@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>  		if (!dma)
>  			goto unpin_exit;
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>  	}
>  
>  unpin_exit:
> @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	return bitmap;
>  }
>  
> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> +				  size_t size, uint64_t pgsize,
> +				  unsigned char __user *bitmap)
> +{
> +	struct vfio_dma *dma;
> +	dma_addr_t i = iova, iova_limit;
> +	unsigned int bsize, nbits = 0, l = 0;
> +	unsigned long pgshift = __ffs(pgsize);
> +
> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> +		int ret, j;
> +		unsigned int npages = 0, shift = 0;
> +		unsigned char temp = 0;
> +
> +		/* mark all pages dirty if all pages are pinned and mapped. */
> +		if (dma->iommu_mapped) {
> +			iova_limit = min(dma->iova + dma->size, iova + size);
> +			npages = iova_limit/pgsize;
> +			bitmap_set(dma->bitmap, 0, npages);

npages is derived from iova_limit, which is the number of bits to set
dirty relative to the first requested iova, not iova zero, ie. the set
of dirty bits is offset from those requested unless iova == dma->iova.

Also I hope dma->bitmap was actually allocated.  Not only does the
START error path potentially leave dirty tracking enabled without all
the bitmap allocated, when does the bitmap get allocated for a new
vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
vpfn gets marked dirty.

> +		} else if (dma->bitmap) {
> +			struct rb_node *n = rb_first(&dma->pfn_list);
> +			bool found = false;
> +
> +			for (; n; n = rb_next(n)) {
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +						struct vfio_pfn, node);
> +				if (vpfn->iova >= i) {
> +					found = true;
> +					break;
> +				}
> +			}
> +
> +			if (!found) {
> +				i += dma->size;
> +				continue;
> +			}
> +
> +			for (; n; n = rb_next(n)) {
> +				unsigned int s;
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +						struct vfio_pfn, node);
> +
> +				if (vpfn->iova >= iova + size)
> +					break;
> +
> +				s = (vpfn->iova - dma->iova) >> pgshift;
> +				bitmap_set(dma->bitmap, s, 1);
> +
> +				iova_limit = vpfn->iova + pgsize;
> +			}
> +			npages = iova_limit/pgsize;

Isn't iova_limit potentially uninitialized here?  For example, if our
vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
(4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
4096 and break, and npages = ????/pgsize.

> +		}
> +
> +		bsize = dirty_bitmap_bytes(npages);
> +		shift = nbits % BITS_PER_BYTE;
> +
> +		if (npages && shift) {
> +			l--;
> +			if (!access_ok((void __user *)bitmap + l,
> +					sizeof(unsigned char)))
> +				return -EINVAL;
> +
> +			ret = __get_user(temp, bitmap + l);

I don't understand why we care to get the user's bitmap, are we trying
to leave whatever garbage they might have set in it and only also set
the dirty bits?  That seems unnecessary.

Also why do we need these access_ok() checks when we already checked
the range at the start of the ioctl?

> +			if (ret)
> +				return ret;
> +		}
> +
> +		for (j = 0; j < bsize; j++, l++) {
> +			temp = temp |
> +			       (*((unsigned char *)dma->bitmap + j) << shift);

|=

> +			if (!access_ok((void __user *)bitmap + l,
> +					sizeof(unsigned char)))
> +				return -EINVAL;
> +
> +			ret = __put_user(temp, bitmap + l);
> +			if (ret)
> +				return ret;
> +			if (shift) {
> +				temp = *((unsigned char *)dma->bitmap + j) >>
> +					(BITS_PER_BYTE - shift);
> +			}

When shift == 0, temp just seems to accumulate bits that never get
cleared.

> +		}
> +
> +		nbits += npages;
> +
> +		i = min(dma->iova + dma->size, iova + size);
> +		if (i >= iova + size)
> +			break;

So whether we error or succeed, we leave cruft in dma->bitmap for the
next pass.  It doesn't seem to make any sense why we pre-allocated the
bitmap, we might as well just allocate it on demand here.  Actually, if
we're not going to do a copy_to_user() for some range of the bitmap,
I'm not sure what it's purpose is at all.  I think the big advantages
of the bitmap are that we can't amortize the cost across every pinned
page or DMA mapping, we don't need the overhead of tracking unmapped
vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
getting any of those advantages here.

> +	}
> +	return 0;
> +}
> +
> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> +{
> +	long bsize;
> +
> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> +		return -EINVAL;
> +
> +	bsize = dirty_bitmap_bytes(npages);
> +
> +	if (bitmap_size < bsize)
> +		return -EINVAL;
> +
> +	return bsize;
> +}

Seems like this could simply return int, -errno or zero for success.
The returned bsize is not used for anything else.

> +
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  			     struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> +		struct vfio_iommu_type1_dirty_bitmap range;
> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +		int ret;
> +
> +		if (!iommu->v2)
> +			return -EACCES;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> +				    bitmap);

We require the user to provide iova, size, pgsize, bitmap_size, and
bitmap fields to START/STOP?  Why?

> +
> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (range.argsz < minsz || range.flags & ~mask)
> +			return -EINVAL;
> +
> +		/* only one flag should be set at a time */
> +		if (__ffs(range.flags) != __fls(range.flags))
> +			return -EINVAL;
> +
> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +
> +			mutex_lock(&iommu->lock);
> +			iommu->dirty_page_tracking = true;
> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);

So dirty page tracking is enabled even if we fail to allocate all the
bitmaps?  Shouldn't this return an error if dirty tracking is already
enabled?

> +			mutex_unlock(&iommu->lock);
> +			return ret;
> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> +			mutex_lock(&iommu->lock);
> +			iommu->dirty_page_tracking = false;

Shouldn't we only allow STOP if tracking is enabled?

> +			vfio_dma_all_bitmap_free(iommu);

Here's where that user induced double free enters the picture.

> +			vfio_remove_unpinned_from_dma_list(iommu);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		} else if (range.flags &
> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> +			long bsize;
> +			unsigned long pgshift = __ffs(range.pgsize);
> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +			uint64_t iommu_pgmask =
> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
> +
> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
> +				return -EINVAL;
> +			if (range.iova & iommu_pgmask)
> +				return -EINVAL;
> +			if (!range.size || range.size & iommu_pgmask)
> +				return -EINVAL;
> +			if (range.iova + range.size < range.iova)
> +				return -EINVAL;
> +			if (!access_ok((void __user *)range.bitmap,
> +				       range.bitmap_size))
> +				return -EINVAL;
> +
> +			bsize = verify_bitmap_size(range.size >> pgshift,
> +						   range.bitmap_size);
> +			if (bsize < 0)
> +				return bsize;
> +
> +			mutex_lock(&iommu->lock);
> +			if (iommu->dirty_page_tracking)
> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
> +					 range.size, range.pgsize,
> +					 (unsigned char __user *)range.bitmap);
> +			else
> +				ret = -EINVAL;
> +			mutex_unlock(&iommu->lock);
> +
> +			return ret;
> +		}
>  	}
>  
>  	return -ENOTTY;

Thanks,
Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
  2020-02-07 19:42   ` Kirti Wankhede
@ 2020-02-10 17:48     ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 17:48 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Sat, 8 Feb 2020 01:12:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Pages, pinned by external interface for requested IO virtual address
> range,  might get unpinned  and unmapped while migration is active and
> device is still running, that is, in pre-copy phase while guest driver
> still could access those pages. Host device can write to these pages while
> those were mapped. Such pages should be marked dirty so that after
> migration guest driver should still be able to complete the operation.
> 
> To get bitmap during unmap, user should set flag
> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
> zeroed by user space application. Bitmap size and page size should be set
> by user application.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 56 +++++++++++++++++++++++++++++++++++++----
>  include/uapi/linux/vfio.h       | 12 +++++++++
>  2 files changed, 63 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index df358dc1c85b..4e6ad0513932 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1032,7 +1032,8 @@ static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>  }
>  
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> -			     struct vfio_iommu_type1_dma_unmap *unmap)
> +			     struct vfio_iommu_type1_dma_unmap *unmap,
> +			     unsigned long *bitmap)
>  {
>  	uint64_t mask;
>  	struct vfio_dma *dma, *dma_last = NULL;
> @@ -1107,6 +1108,15 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		if (dma->task->mm != current->mm)
>  			break;
>  
> +		if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
> +		    (dma_last != dma))
> +			vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size,
> +					       unmap->bitmap_pgsize,
> +					      (unsigned char __user *) bitmap);
> +		else
> +			vfio_remove_unpinned_from_pfn_list(dma);

Isn't there a race here?  A vendor driver could have an outstanding
page pin request that's blocked on the iommu mutex, as soon as we
release the mutex to notify vendor drivers of the unmap, it would get
pinned, but on the next iteration we're not going to collect any new
dirty pages.  This whole sequence seems to favor tracking unpinned
pages in the dirty bitmap rather than the vpfn list with ref_count = 0.

Maybe this is also the reason for the access_ok() checks in
vfio_iova_dirty_bitmap(), but couldn't we do that in advance here for
the whole bitmap?  Thanks,

Alex

> +
> +
>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>  
> @@ -1132,6 +1142,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  						    &nb_unmap);
>  			goto again;
>  		}
> +
>  		unmapped += dma->size;
>  		vfio_remove_dma(iommu, dma);
>  	}
> @@ -2462,22 +2473,57 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
>  		struct vfio_iommu_type1_dma_unmap unmap;
> -		long ret;
> +		unsigned long *bitmap = NULL;
> +		long ret, bsize;
>  
>  		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
>  
>  		if (copy_from_user(&unmap, (void __user *)arg, minsz))
>  			return -EFAULT;
>  
> -		if (unmap.argsz < minsz || unmap.flags)
> +		if (unmap.argsz < minsz ||
> +		    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
>  			return -EINVAL;
>  
> -		ret = vfio_dma_do_unmap(iommu, &unmap);
> +		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> +			unsigned long pgshift;
> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +			uint64_t iommu_pgmask =
> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
> +
> +			if (copy_from_user(&unmap, (void __user *)arg,
> +					   sizeof(unmap)))
> +				return -EFAULT;
> +
> +			pgshift = __ffs(unmap.bitmap_pgsize);
> +
> +			if (((unmap.bitmap_pgsize - 1) & iommu_pgmask) !=
> +			     (unmap.bitmap_pgsize - 1))
> +				return -EINVAL;
> +
> +			if ((unmap.bitmap_pgsize & iommu_pgsizes) !=
> +			     unmap.bitmap_pgsize)
> +				return -EINVAL;
> +			if (unmap.iova + unmap.size < unmap.iova)
> +				return -EINVAL;
> +			if (!access_ok((void __user *)unmap.bitmap,
> +				       unmap.bitmap_size))
> +				return -EINVAL;
> +
> +			bsize = verify_bitmap_size(unmap.size >> pgshift,
> +						   unmap.bitmap_size);
> +			if (bsize < 0)
> +				return bsize;
> +			bitmap = unmap.bitmap;
> +		}
> +
> +		ret = vfio_dma_do_unmap(iommu, &unmap, bitmap);
>  		if (ret)
>  			return ret;
>  
> -		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> +		ret = copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +		return ret;
>  	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>  		struct vfio_iommu_type1_dirty_bitmap range;
>  		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index b1b03c720749..a852e729b5a2 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -985,12 +985,24 @@ struct vfio_iommu_type1_dma_map {
>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>   * or size different from those used in the original mapping call will
>   * succeed.
> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
> + * before unmapping IO virtual addresses. When this flag is set, user should
> + * allocate memory to get bitmap, clear the bitmap memory by setting zero and
> + * should set size of allocated memory in bitmap_size field. One bit in bitmap
> + * represents per page , page of user provided page size in 'bitmap_pgsize',
> + * consecutively starting from iova offset. Bit set indicates page at that
> + * offset from iova is dirty. Bitmap of pages in the range of unmapped size is
> + * returned in bitmap.
>   */
>  struct vfio_iommu_type1_dma_unmap {
>  	__u32	argsz;
>  	__u32	flags;
> +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
> +	__u64        bitmap_pgsize;		/* page size for bitmap */
> +	__u64        bitmap_size;               /* in bytes */
> +	void __user *bitmap;                    /* one bit per page */
>  };
>  
>  #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
@ 2020-02-10 17:48     ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 17:48 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 8 Feb 2020 01:12:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Pages, pinned by external interface for requested IO virtual address
> range,  might get unpinned  and unmapped while migration is active and
> device is still running, that is, in pre-copy phase while guest driver
> still could access those pages. Host device can write to these pages while
> those were mapped. Such pages should be marked dirty so that after
> migration guest driver should still be able to complete the operation.
> 
> To get bitmap during unmap, user should set flag
> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
> zeroed by user space application. Bitmap size and page size should be set
> by user application.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 56 +++++++++++++++++++++++++++++++++++++----
>  include/uapi/linux/vfio.h       | 12 +++++++++
>  2 files changed, 63 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index df358dc1c85b..4e6ad0513932 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1032,7 +1032,8 @@ static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>  }
>  
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> -			     struct vfio_iommu_type1_dma_unmap *unmap)
> +			     struct vfio_iommu_type1_dma_unmap *unmap,
> +			     unsigned long *bitmap)
>  {
>  	uint64_t mask;
>  	struct vfio_dma *dma, *dma_last = NULL;
> @@ -1107,6 +1108,15 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		if (dma->task->mm != current->mm)
>  			break;
>  
> +		if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
> +		    (dma_last != dma))
> +			vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size,
> +					       unmap->bitmap_pgsize,
> +					      (unsigned char __user *) bitmap);
> +		else
> +			vfio_remove_unpinned_from_pfn_list(dma);

Isn't there a race here?  A vendor driver could have an outstanding
page pin request that's blocked on the iommu mutex, as soon as we
release the mutex to notify vendor drivers of the unmap, it would get
pinned, but on the next iteration we're not going to collect any new
dirty pages.  This whole sequence seems to favor tracking unpinned
pages in the dirty bitmap rather than the vpfn list with ref_count = 0.

Maybe this is also the reason for the access_ok() checks in
vfio_iova_dirty_bitmap(), but couldn't we do that in advance here for
the whole bitmap?  Thanks,

Alex

> +
> +
>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>  
> @@ -1132,6 +1142,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  						    &nb_unmap);
>  			goto again;
>  		}
> +
>  		unmapped += dma->size;
>  		vfio_remove_dma(iommu, dma);
>  	}
> @@ -2462,22 +2473,57 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
>  		struct vfio_iommu_type1_dma_unmap unmap;
> -		long ret;
> +		unsigned long *bitmap = NULL;
> +		long ret, bsize;
>  
>  		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
>  
>  		if (copy_from_user(&unmap, (void __user *)arg, minsz))
>  			return -EFAULT;
>  
> -		if (unmap.argsz < minsz || unmap.flags)
> +		if (unmap.argsz < minsz ||
> +		    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
>  			return -EINVAL;
>  
> -		ret = vfio_dma_do_unmap(iommu, &unmap);
> +		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> +			unsigned long pgshift;
> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> +			uint64_t iommu_pgmask =
> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
> +
> +			if (copy_from_user(&unmap, (void __user *)arg,
> +					   sizeof(unmap)))
> +				return -EFAULT;
> +
> +			pgshift = __ffs(unmap.bitmap_pgsize);
> +
> +			if (((unmap.bitmap_pgsize - 1) & iommu_pgmask) !=
> +			     (unmap.bitmap_pgsize - 1))
> +				return -EINVAL;
> +
> +			if ((unmap.bitmap_pgsize & iommu_pgsizes) !=
> +			     unmap.bitmap_pgsize)
> +				return -EINVAL;
> +			if (unmap.iova + unmap.size < unmap.iova)
> +				return -EINVAL;
> +			if (!access_ok((void __user *)unmap.bitmap,
> +				       unmap.bitmap_size))
> +				return -EINVAL;
> +
> +			bsize = verify_bitmap_size(unmap.size >> pgshift,
> +						   unmap.bitmap_size);
> +			if (bsize < 0)
> +				return bsize;
> +			bitmap = unmap.bitmap;
> +		}
> +
> +		ret = vfio_dma_do_unmap(iommu, &unmap, bitmap);
>  		if (ret)
>  			return ret;
>  
> -		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> +		ret = copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +		return ret;
>  	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>  		struct vfio_iommu_type1_dirty_bitmap range;
>  		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index b1b03c720749..a852e729b5a2 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -985,12 +985,24 @@ struct vfio_iommu_type1_dma_map {
>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>   * or size different from those used in the original mapping call will
>   * succeed.
> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
> + * before unmapping IO virtual addresses. When this flag is set, user should
> + * allocate memory to get bitmap, clear the bitmap memory by setting zero and
> + * should set size of allocated memory in bitmap_size field. One bit in bitmap
> + * represents per page , page of user provided page size in 'bitmap_pgsize',
> + * consecutively starting from iova offset. Bit set indicates page at that
> + * offset from iova is dirty. Bitmap of pages in the range of unmapped size is
> + * returned in bitmap.
>   */
>  struct vfio_iommu_type1_dma_unmap {
>  	__u32	argsz;
>  	__u32	flags;
> +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
> +	__u64        bitmap_pgsize;		/* page size for bitmap */
> +	__u64        bitmap_size;               /* in bytes */
> +	void __user *bitmap;                    /* one bit per page */
>  };
>  
>  #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages
  2020-02-07 19:42   ` Kirti Wankhede
@ 2020-02-10 18:14     ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 18:14 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Sat, 8 Feb 2020 01:12:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added a check such that only singleton IOMMU groups can pin pages.
> From the point when vendor driver pins any pages, consider IOMMU group
> dirty page scope to be limited to pinned pages.
> 
> To optimize to avoid walking list often, added flag
> pinned_page_dirty_scope to indicate if all of the vfio_groups for each
> vfio_domain in the domain_list dirty page scope is limited to pinned
> pages. This flag is updated on first pinned pages request for that IOMMU
> group and on attaching/detaching group.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio.c             | 13 +++++++-
>  drivers/vfio/vfio_iommu_type1.c | 72 +++++++++++++++++++++++++++++++++++++++--
>  include/linux/vfio.h            |  4 ++-
>  3 files changed, 84 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c8482624ca34..a941c860b440 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -87,6 +87,7 @@ struct vfio_group {
>  	bool				noiommu;
>  	struct kvm			*kvm;
>  	struct blocking_notifier_head	notifier;
> +	bool				is_singleton;

There's already a hole in the structure alignment under noiommu, we can
add this there an avoid actually increasing the structure size.

>  };
>  
>  struct vfio_device {
> @@ -838,6 +839,12 @@ int vfio_add_group_dev(struct device *dev,
>  		return PTR_ERR(device);
>  	}
>  
> +	mutex_lock(&group->device_lock);
> +	group->is_singleton = false;
> +	if (list_is_singular(&group->device_list))
> +		group->is_singleton = true;
> +	mutex_unlock(&group->device_lock);
> +

vfio_create_group() should set the initial value of is_singleton,
vfio_group_create_device() and vfio_device_release() should be where
it's modified.  It might be easier to simply have a device counter on
the group that gets incremented and decremented where is_singleton is
just a macro or alias for counter == 1.

>  	/*
>  	 * Drop all but the vfio_device reference.  The vfio_device holds
>  	 * a reference to the vfio_group, which holds a reference to the
> @@ -1895,6 +1902,9 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
>  	if (!group)
>  		return -ENODEV;
>  
> +	if (!group->is_singleton)
> +		return -EINVAL;
> +
>  	ret = vfio_group_add_container_user(group);
>  	if (ret)
>  		goto err_pin_pages;
> @@ -1902,7 +1912,8 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
>  	container = group->container;
>  	driver = container->iommu_driver;
>  	if (likely(driver && driver->ops->pin_pages))
> -		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> +		ret = driver->ops->pin_pages(container->iommu_data,
> +					     group->iommu_group, user_pfn,
>  					     npage, prot, phys_pfn);
>  	else
>  		ret = -ENOTTY;


Don't we also need to prevent a device from being added to a singleton
group that has had pinned pages?  I think the group would set the flag
here (on success), clear it in __vfio_group_unset_container() and
perhaps vfio_group_create_device() would error if the group has pinned
pages.

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index f748a3dbe9f9..a787a2bcd757 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -71,6 +71,7 @@ struct vfio_iommu {
>  	bool			v2;
>  	bool			nesting;
>  	bool			dirty_page_tracking;
> +	bool			pinned_page_dirty_scope;
>  };
>  
>  struct vfio_domain {
> @@ -98,6 +99,7 @@ struct vfio_group {
>  	struct iommu_group	*iommu_group;
>  	struct list_head	next;
>  	bool			mdev_group;	/* An mdev group */
> +	bool			has_pinned_pages;
>  };
>  
>  struct vfio_iova {
> @@ -129,6 +131,10 @@ struct vfio_regions {
>  static int put_pfn(unsigned long pfn, int prot);
>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
>  
> +static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
> +					       struct iommu_group *iommu_group);
> +
> +static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu);
>  /*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
> @@ -580,11 +586,13 @@ static int vfio_unpin_page_external(struct vfio_iommu *iommu,
>  }
>  
>  static int vfio_iommu_type1_pin_pages(void *iommu_data,
> +				      struct iommu_group *iommu_group,
>  				      unsigned long *user_pfn,
>  				      int npage, int prot,
>  				      unsigned long *phys_pfn)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_group *group;
>  	int i, j, ret;
>  	unsigned long remote_vaddr;
>  	struct vfio_dma *dma;
> @@ -661,8 +669,14 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  			}
>  		}
>  	}
> -
>  	ret = i;
> +
> +	group = vfio_iommu_find_iommu_group(iommu, iommu_group);
> +	if (!group->has_pinned_pages) {
> +		group->has_pinned_pages = true;
> +		update_pinned_page_dirty_scope(iommu);
> +	}

If vfio.c were tracking whether a group had pinned pages it could pass
that as an arg to this function and the entire group lookup and dirty
scope processing could be conditional on whether vfio tells us this
group already has pinned pages in the past.  Thanks,

Alex

> +
>  	goto pin_done;
>  
>  pin_unwind:
> @@ -938,8 +952,11 @@ static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>  		unsigned int npages = 0, shift = 0;
>  		unsigned char temp = 0;
>  
> -		/* mark all pages dirty if all pages are pinned and mapped. */
> -		if (dma->iommu_mapped) {
> +		/*
> +		 * mark all pages dirty if any IOMMU capable device is not able
> +		 * to report dirty pages and all pages are pinned and mapped.
> +		 */
> +		if (!iommu->pinned_page_dirty_scope && dma->iommu_mapped) {
>  			iova_limit = min(dma->iova + dma->size, iova + size);
>  			npages = iova_limit/pgsize;
>  			bitmap_set(dma->bitmap, 0, npages);
> @@ -1479,6 +1496,51 @@ static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
>  	return NULL;
>  }
>  
> +static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
> +					       struct iommu_group *iommu_group)
> +{
> +	struct vfio_domain *domain;
> +	struct vfio_group *group = NULL;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group)
> +			return group;
> +	}
> +
> +	if (iommu->external_domain)
> +		group = find_iommu_group(iommu->external_domain, iommu_group);
> +
> +	return group;
> +}
> +
> +static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain;
> +	struct vfio_group *group;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		list_for_each_entry(group, &domain->group_list, next) {
> +			if (!group->has_pinned_pages) {
> +				iommu->pinned_page_dirty_scope = false;
> +				return;
> +			}
> +		}
> +	}
> +
> +	if (iommu->external_domain) {
> +		domain = iommu->external_domain;
> +		list_for_each_entry(group, &domain->group_list, next) {
> +			if (!group->has_pinned_pages) {
> +				iommu->pinned_page_dirty_scope = false;
> +				return;
> +			}
> +		}
> +	}
> +
> +	iommu->pinned_page_dirty_scope = true;
> +}
> +
>  static bool vfio_iommu_has_sw_msi(struct list_head *group_resv_regions,
>  				  phys_addr_t *base)
>  {
> @@ -1885,6 +1947,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  
>  			list_add(&group->next,
>  				 &iommu->external_domain->group_list);
> +			update_pinned_page_dirty_scope(iommu);
>  			mutex_unlock(&iommu->lock);
>  
>  			return 0;
> @@ -2007,6 +2070,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  done:
>  	/* Delete the old one and insert new iova list */
>  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> +	update_pinned_page_dirty_scope(iommu);
>  	mutex_unlock(&iommu->lock);
>  	vfio_iommu_resv_free(&group_resv_regions);
>  
> @@ -2021,6 +2085,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  out_free:
>  	kfree(domain);
>  	kfree(group);
> +	update_pinned_page_dirty_scope(iommu);
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -2225,6 +2290,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  		vfio_iommu_iova_free(&iova_copy);
>  
>  detach_group_done:
> +	update_pinned_page_dirty_scope(iommu);
>  	mutex_unlock(&iommu->lock);
>  }
>  
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e42a711a2800..da29802d6276 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -72,7 +72,9 @@ struct vfio_iommu_driver_ops {
>  					struct iommu_group *group);
>  	void		(*detach_group)(void *iommu_data,
>  					struct iommu_group *group);
> -	int		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
> +	int		(*pin_pages)(void *iommu_data,
> +				     struct iommu_group *group,
> +				     unsigned long *user_pfn,
>  				     int npage, int prot,
>  				     unsigned long *phys_pfn);
>  	int		(*unpin_pages)(void *iommu_data,


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages
@ 2020-02-10 18:14     ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 18:14 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 8 Feb 2020 01:12:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added a check such that only singleton IOMMU groups can pin pages.
> From the point when vendor driver pins any pages, consider IOMMU group
> dirty page scope to be limited to pinned pages.
> 
> To optimize to avoid walking list often, added flag
> pinned_page_dirty_scope to indicate if all of the vfio_groups for each
> vfio_domain in the domain_list dirty page scope is limited to pinned
> pages. This flag is updated on first pinned pages request for that IOMMU
> group and on attaching/detaching group.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio.c             | 13 +++++++-
>  drivers/vfio/vfio_iommu_type1.c | 72 +++++++++++++++++++++++++++++++++++++++--
>  include/linux/vfio.h            |  4 ++-
>  3 files changed, 84 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c8482624ca34..a941c860b440 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -87,6 +87,7 @@ struct vfio_group {
>  	bool				noiommu;
>  	struct kvm			*kvm;
>  	struct blocking_notifier_head	notifier;
> +	bool				is_singleton;

There's already a hole in the structure alignment under noiommu, we can
add this there an avoid actually increasing the structure size.

>  };
>  
>  struct vfio_device {
> @@ -838,6 +839,12 @@ int vfio_add_group_dev(struct device *dev,
>  		return PTR_ERR(device);
>  	}
>  
> +	mutex_lock(&group->device_lock);
> +	group->is_singleton = false;
> +	if (list_is_singular(&group->device_list))
> +		group->is_singleton = true;
> +	mutex_unlock(&group->device_lock);
> +

vfio_create_group() should set the initial value of is_singleton,
vfio_group_create_device() and vfio_device_release() should be where
it's modified.  It might be easier to simply have a device counter on
the group that gets incremented and decremented where is_singleton is
just a macro or alias for counter == 1.

>  	/*
>  	 * Drop all but the vfio_device reference.  The vfio_device holds
>  	 * a reference to the vfio_group, which holds a reference to the
> @@ -1895,6 +1902,9 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
>  	if (!group)
>  		return -ENODEV;
>  
> +	if (!group->is_singleton)
> +		return -EINVAL;
> +
>  	ret = vfio_group_add_container_user(group);
>  	if (ret)
>  		goto err_pin_pages;
> @@ -1902,7 +1912,8 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
>  	container = group->container;
>  	driver = container->iommu_driver;
>  	if (likely(driver && driver->ops->pin_pages))
> -		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> +		ret = driver->ops->pin_pages(container->iommu_data,
> +					     group->iommu_group, user_pfn,
>  					     npage, prot, phys_pfn);
>  	else
>  		ret = -ENOTTY;


Don't we also need to prevent a device from being added to a singleton
group that has had pinned pages?  I think the group would set the flag
here (on success), clear it in __vfio_group_unset_container() and
perhaps vfio_group_create_device() would error if the group has pinned
pages.

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index f748a3dbe9f9..a787a2bcd757 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -71,6 +71,7 @@ struct vfio_iommu {
>  	bool			v2;
>  	bool			nesting;
>  	bool			dirty_page_tracking;
> +	bool			pinned_page_dirty_scope;
>  };
>  
>  struct vfio_domain {
> @@ -98,6 +99,7 @@ struct vfio_group {
>  	struct iommu_group	*iommu_group;
>  	struct list_head	next;
>  	bool			mdev_group;	/* An mdev group */
> +	bool			has_pinned_pages;
>  };
>  
>  struct vfio_iova {
> @@ -129,6 +131,10 @@ struct vfio_regions {
>  static int put_pfn(unsigned long pfn, int prot);
>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
>  
> +static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
> +					       struct iommu_group *iommu_group);
> +
> +static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu);
>  /*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
> @@ -580,11 +586,13 @@ static int vfio_unpin_page_external(struct vfio_iommu *iommu,
>  }
>  
>  static int vfio_iommu_type1_pin_pages(void *iommu_data,
> +				      struct iommu_group *iommu_group,
>  				      unsigned long *user_pfn,
>  				      int npage, int prot,
>  				      unsigned long *phys_pfn)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_group *group;
>  	int i, j, ret;
>  	unsigned long remote_vaddr;
>  	struct vfio_dma *dma;
> @@ -661,8 +669,14 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  			}
>  		}
>  	}
> -
>  	ret = i;
> +
> +	group = vfio_iommu_find_iommu_group(iommu, iommu_group);
> +	if (!group->has_pinned_pages) {
> +		group->has_pinned_pages = true;
> +		update_pinned_page_dirty_scope(iommu);
> +	}

If vfio.c were tracking whether a group had pinned pages it could pass
that as an arg to this function and the entire group lookup and dirty
scope processing could be conditional on whether vfio tells us this
group already has pinned pages in the past.  Thanks,

Alex

> +
>  	goto pin_done;
>  
>  pin_unwind:
> @@ -938,8 +952,11 @@ static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>  		unsigned int npages = 0, shift = 0;
>  		unsigned char temp = 0;
>  
> -		/* mark all pages dirty if all pages are pinned and mapped. */
> -		if (dma->iommu_mapped) {
> +		/*
> +		 * mark all pages dirty if any IOMMU capable device is not able
> +		 * to report dirty pages and all pages are pinned and mapped.
> +		 */
> +		if (!iommu->pinned_page_dirty_scope && dma->iommu_mapped) {
>  			iova_limit = min(dma->iova + dma->size, iova + size);
>  			npages = iova_limit/pgsize;
>  			bitmap_set(dma->bitmap, 0, npages);
> @@ -1479,6 +1496,51 @@ static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
>  	return NULL;
>  }
>  
> +static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
> +					       struct iommu_group *iommu_group)
> +{
> +	struct vfio_domain *domain;
> +	struct vfio_group *group = NULL;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group)
> +			return group;
> +	}
> +
> +	if (iommu->external_domain)
> +		group = find_iommu_group(iommu->external_domain, iommu_group);
> +
> +	return group;
> +}
> +
> +static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain;
> +	struct vfio_group *group;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		list_for_each_entry(group, &domain->group_list, next) {
> +			if (!group->has_pinned_pages) {
> +				iommu->pinned_page_dirty_scope = false;
> +				return;
> +			}
> +		}
> +	}
> +
> +	if (iommu->external_domain) {
> +		domain = iommu->external_domain;
> +		list_for_each_entry(group, &domain->group_list, next) {
> +			if (!group->has_pinned_pages) {
> +				iommu->pinned_page_dirty_scope = false;
> +				return;
> +			}
> +		}
> +	}
> +
> +	iommu->pinned_page_dirty_scope = true;
> +}
> +
>  static bool vfio_iommu_has_sw_msi(struct list_head *group_resv_regions,
>  				  phys_addr_t *base)
>  {
> @@ -1885,6 +1947,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  
>  			list_add(&group->next,
>  				 &iommu->external_domain->group_list);
> +			update_pinned_page_dirty_scope(iommu);
>  			mutex_unlock(&iommu->lock);
>  
>  			return 0;
> @@ -2007,6 +2070,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  done:
>  	/* Delete the old one and insert new iova list */
>  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> +	update_pinned_page_dirty_scope(iommu);
>  	mutex_unlock(&iommu->lock);
>  	vfio_iommu_resv_free(&group_resv_regions);
>  
> @@ -2021,6 +2085,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  out_free:
>  	kfree(domain);
>  	kfree(group);
> +	update_pinned_page_dirty_scope(iommu);
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -2225,6 +2290,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  		vfio_iommu_iova_free(&iova_copy);
>  
>  detach_group_done:
> +	update_pinned_page_dirty_scope(iommu);
>  	mutex_unlock(&iommu->lock);
>  }
>  
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e42a711a2800..da29802d6276 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -72,7 +72,9 @@ struct vfio_iommu_driver_ops {
>  					struct iommu_group *group);
>  	void		(*detach_group)(void *iommu_data,
>  					struct iommu_group *group);
> -	int		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
> +	int		(*pin_pages)(void *iommu_data,
> +				     struct iommu_group *group,
> +				     unsigned long *user_pfn,
>  				     int npage, int prot,
>  				     unsigned long *phys_pfn);
>  	int		(*unpin_pages)(void *iommu_data,



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-10  9:49     ` Yan Zhao
@ 2020-02-10 19:44       ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 19:44 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Mon, 10 Feb 2020 04:49:54 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:
> > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > - Start pinned and unpinned pages tracking while migration is active
> > - Stop pinned and unpinned dirty pages tracking. This is also used to
> >   stop dirty pages tracking if migration failed or cancelled.
> > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> >   user space application responsibility to copy content of dirty pages
> >   from source to destination during migration.
> > 
> > To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> > structure. Bitmap size is calculated considering smallest supported page
> > size. Bitmap is allocated when dirty logging is enabled for those
> > vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> > case of pass-through device.
> > 
> > There could be multiple option as to when bitmap should be populated:
> > * Polulate bitmap for already pinned pages when bitmap is allocated for
> >   a vfio_dma with the smallest supported page size. Updates bitmap from
> >   page pinning and unpinning functions. When user application queries
> >   bitmap, check if requested page size is same as page size used to
> >   populated bitmap. If it is equal, copy bitmap. But if not equal,
> >   re-populated bitmap according to requested page size and then copy to
> >   user.
> >   Pros: Bitmap gets populated on the fly after dirty tracking has
> >         started.
> >   Cons: If requested page size is different than smallest supported
> >         page size, then bitmap has to be re-populated again, with
> >         additional overhead of allocating bitmap memory again for
> >         re-population of bitmap.
> > 
> > * Populate bitmap when bitmap is queried by user application.
> >   Pros: Bitmap is populated with requested page size. This eliminates
> >         the need to re-populate bitmap if requested page size is
> >         different than smallest supported pages size.
> >   Cons: There is one time processing time, when bitmap is queried.
> > 
> > I prefer later option with simple logic and to eliminate over-head of
> > bitmap repopulation in case of differnt page sizes. Later option is
> > implemented in this patch.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 287 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index d386461e5d11..df358dc1c85b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
[snip]
> > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >  	return bitmap;
> >  }
> >  
> > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > +				  size_t size, uint64_t pgsize,
> > +				  unsigned char __user *bitmap)
> > +{
> > +	struct vfio_dma *dma;
> > +	dma_addr_t i = iova, iova_limit;
> > +	unsigned int bsize, nbits = 0, l = 0;
> > +	unsigned long pgshift = __ffs(pgsize);
> > +
> > +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > +		int ret, j;
> > +		unsigned int npages = 0, shift = 0;
> > +		unsigned char temp = 0;
> > +
> > +		/* mark all pages dirty if all pages are pinned and mapped. */
> > +		if (dma->iommu_mapped) {
> > +			iova_limit = min(dma->iova + dma->size, iova + size);
> > +			npages = iova_limit/pgsize;
> > +			bitmap_set(dma->bitmap, 0, npages);  
> for pass-through devices, it's not good to always return all pinned pages as
> dirty. could it also call vfio_pin_pages to track dirty pages? or any
> other interface provided to do that?

See patch 7/7.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-10 19:44       ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-10 19:44 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, kvm, eskultet, Yang,
	Ziye, qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Mon, 10 Feb 2020 04:49:54 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:
> > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > - Start pinned and unpinned pages tracking while migration is active
> > - Stop pinned and unpinned dirty pages tracking. This is also used to
> >   stop dirty pages tracking if migration failed or cancelled.
> > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> >   user space application responsibility to copy content of dirty pages
> >   from source to destination during migration.
> > 
> > To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> > structure. Bitmap size is calculated considering smallest supported page
> > size. Bitmap is allocated when dirty logging is enabled for those
> > vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> > case of pass-through device.
> > 
> > There could be multiple option as to when bitmap should be populated:
> > * Polulate bitmap for already pinned pages when bitmap is allocated for
> >   a vfio_dma with the smallest supported page size. Updates bitmap from
> >   page pinning and unpinning functions. When user application queries
> >   bitmap, check if requested page size is same as page size used to
> >   populated bitmap. If it is equal, copy bitmap. But if not equal,
> >   re-populated bitmap according to requested page size and then copy to
> >   user.
> >   Pros: Bitmap gets populated on the fly after dirty tracking has
> >         started.
> >   Cons: If requested page size is different than smallest supported
> >         page size, then bitmap has to be re-populated again, with
> >         additional overhead of allocating bitmap memory again for
> >         re-population of bitmap.
> > 
> > * Populate bitmap when bitmap is queried by user application.
> >   Pros: Bitmap is populated with requested page size. This eliminates
> >         the need to re-populate bitmap if requested page size is
> >         different than smallest supported pages size.
> >   Cons: There is one time processing time, when bitmap is queried.
> > 
> > I prefer later option with simple logic and to eliminate over-head of
> > bitmap repopulation in case of differnt page sizes. Later option is
> > implemented in this patch.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 287 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index d386461e5d11..df358dc1c85b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
[snip]
> > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >  	return bitmap;
> >  }
> >  
> > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > +				  size_t size, uint64_t pgsize,
> > +				  unsigned char __user *bitmap)
> > +{
> > +	struct vfio_dma *dma;
> > +	dma_addr_t i = iova, iova_limit;
> > +	unsigned int bsize, nbits = 0, l = 0;
> > +	unsigned long pgshift = __ffs(pgsize);
> > +
> > +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > +		int ret, j;
> > +		unsigned int npages = 0, shift = 0;
> > +		unsigned char temp = 0;
> > +
> > +		/* mark all pages dirty if all pages are pinned and mapped. */
> > +		if (dma->iommu_mapped) {
> > +			iova_limit = min(dma->iova + dma->size, iova + size);
> > +			npages = iova_limit/pgsize;
> > +			bitmap_set(dma->bitmap, 0, npages);  
> for pass-through devices, it's not good to always return all pinned pages as
> dirty. could it also call vfio_pin_pages to track dirty pages? or any
> other interface provided to do that?

See patch 7/7.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-10 19:44       ` Alex Williamson
@ 2020-02-11  2:52         ` Yan Zhao
  -1 siblings, 0 replies; 62+ messages in thread
From: Yan Zhao @ 2020-02-11  2:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Tue, Feb 11, 2020 at 03:44:54AM +0800, Alex Williamson wrote:
> On Mon, 10 Feb 2020 04:49:54 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:
> > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > > - Start pinned and unpinned pages tracking while migration is active
> > > - Stop pinned and unpinned dirty pages tracking. This is also used to
> > >   stop dirty pages tracking if migration failed or cancelled.
> > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> > >   user space application responsibility to copy content of dirty pages
> > >   from source to destination during migration.
> > > 
> > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> > > structure. Bitmap size is calculated considering smallest supported page
> > > size. Bitmap is allocated when dirty logging is enabled for those
> > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> > > case of pass-through device.
> > > 
> > > There could be multiple option as to when bitmap should be populated:
> > > * Polulate bitmap for already pinned pages when bitmap is allocated for
> > >   a vfio_dma with the smallest supported page size. Updates bitmap from
> > >   page pinning and unpinning functions. When user application queries
> > >   bitmap, check if requested page size is same as page size used to
> > >   populated bitmap. If it is equal, copy bitmap. But if not equal,
> > >   re-populated bitmap according to requested page size and then copy to
> > >   user.
> > >   Pros: Bitmap gets populated on the fly after dirty tracking has
> > >         started.
> > >   Cons: If requested page size is different than smallest supported
> > >         page size, then bitmap has to be re-populated again, with
> > >         additional overhead of allocating bitmap memory again for
> > >         re-population of bitmap.
> > > 
> > > * Populate bitmap when bitmap is queried by user application.
> > >   Pros: Bitmap is populated with requested page size. This eliminates
> > >         the need to re-populate bitmap if requested page size is
> > >         different than smallest supported pages size.
> > >   Cons: There is one time processing time, when bitmap is queried.
> > > 
> > > I prefer later option with simple logic and to eliminate over-head of
> > > bitmap repopulation in case of differnt page sizes. Later option is
> > > implemented in this patch.
> > > 
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 287 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index d386461e5d11..df358dc1c85b 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> [snip]
> > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > >  	return bitmap;
> > >  }
> > >  
> > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > > +				  size_t size, uint64_t pgsize,
> > > +				  unsigned char __user *bitmap)
> > > +{
> > > +	struct vfio_dma *dma;
> > > +	dma_addr_t i = iova, iova_limit;
> > > +	unsigned int bsize, nbits = 0, l = 0;
> > > +	unsigned long pgshift = __ffs(pgsize);
> > > +
> > > +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > > +		int ret, j;
> > > +		unsigned int npages = 0, shift = 0;
> > > +		unsigned char temp = 0;
> > > +
> > > +		/* mark all pages dirty if all pages are pinned and mapped. */
> > > +		if (dma->iommu_mapped) {
> > > +			iova_limit = min(dma->iova + dma->size, iova + size);
> > > +			npages = iova_limit/pgsize;
> > > +			bitmap_set(dma->bitmap, 0, npages);  
> > for pass-through devices, it's not good to always return all pinned pages as
> > dirty. could it also call vfio_pin_pages to track dirty pages? or any
> > other interface provided to do that?
> 
> See patch 7/7.  Thanks,
>
hi Alex and Kirti,
for pass-through devices, though patch 7/7 enables the vendor driver to
set dirty pages by calling vfio_pin_pages, however, its overhead is much
higher than the previous way of generating a bitmap directly to user.
And it also requires pass-through device vendor driver to track guest
operations to know when to call vfio_pin_pages.
There are still use cases like a pass-through device is able to track
dirty pages in its hardware buffer, so is there a way for it pass its
dirty bitmap to user?

Thanks
Yan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-11  2:52         ` Yan Zhao
  0 siblings, 0 replies; 62+ messages in thread
From: Yan Zhao @ 2020-02-11  2:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, kvm, eskultet, Yang,
	Ziye, qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Feb 11, 2020 at 03:44:54AM +0800, Alex Williamson wrote:
> On Mon, 10 Feb 2020 04:49:54 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:
> > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > > - Start pinned and unpinned pages tracking while migration is active
> > > - Stop pinned and unpinned dirty pages tracking. This is also used to
> > >   stop dirty pages tracking if migration failed or cancelled.
> > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> > >   user space application responsibility to copy content of dirty pages
> > >   from source to destination during migration.
> > > 
> > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> > > structure. Bitmap size is calculated considering smallest supported page
> > > size. Bitmap is allocated when dirty logging is enabled for those
> > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> > > case of pass-through device.
> > > 
> > > There could be multiple option as to when bitmap should be populated:
> > > * Polulate bitmap for already pinned pages when bitmap is allocated for
> > >   a vfio_dma with the smallest supported page size. Updates bitmap from
> > >   page pinning and unpinning functions. When user application queries
> > >   bitmap, check if requested page size is same as page size used to
> > >   populated bitmap. If it is equal, copy bitmap. But if not equal,
> > >   re-populated bitmap according to requested page size and then copy to
> > >   user.
> > >   Pros: Bitmap gets populated on the fly after dirty tracking has
> > >         started.
> > >   Cons: If requested page size is different than smallest supported
> > >         page size, then bitmap has to be re-populated again, with
> > >         additional overhead of allocating bitmap memory again for
> > >         re-population of bitmap.
> > > 
> > > * Populate bitmap when bitmap is queried by user application.
> > >   Pros: Bitmap is populated with requested page size. This eliminates
> > >         the need to re-populate bitmap if requested page size is
> > >         different than smallest supported pages size.
> > >   Cons: There is one time processing time, when bitmap is queried.
> > > 
> > > I prefer later option with simple logic and to eliminate over-head of
> > > bitmap repopulation in case of differnt page sizes. Later option is
> > > implemented in this patch.
> > > 
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 287 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index d386461e5d11..df358dc1c85b 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> [snip]
> > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > >  	return bitmap;
> > >  }
> > >  
> > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > > +				  size_t size, uint64_t pgsize,
> > > +				  unsigned char __user *bitmap)
> > > +{
> > > +	struct vfio_dma *dma;
> > > +	dma_addr_t i = iova, iova_limit;
> > > +	unsigned int bsize, nbits = 0, l = 0;
> > > +	unsigned long pgshift = __ffs(pgsize);
> > > +
> > > +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > > +		int ret, j;
> > > +		unsigned int npages = 0, shift = 0;
> > > +		unsigned char temp = 0;
> > > +
> > > +		/* mark all pages dirty if all pages are pinned and mapped. */
> > > +		if (dma->iommu_mapped) {
> > > +			iova_limit = min(dma->iova + dma->size, iova + size);
> > > +			npages = iova_limit/pgsize;
> > > +			bitmap_set(dma->bitmap, 0, npages);  
> > for pass-through devices, it's not good to always return all pinned pages as
> > dirty. could it also call vfio_pin_pages to track dirty pages? or any
> > other interface provided to do that?
> 
> See patch 7/7.  Thanks,
>
hi Alex and Kirti,
for pass-through devices, though patch 7/7 enables the vendor driver to
set dirty pages by calling vfio_pin_pages, however, its overhead is much
higher than the previous way of generating a bitmap directly to user.
And it also requires pass-through device vendor driver to track guest
operations to know when to call vfio_pin_pages.
There are still use cases like a pass-through device is able to track
dirty pages in its hardware buffer, so is there a way for it pass its
dirty bitmap to user?

Thanks
Yan


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-11  2:52         ` Yan Zhao
@ 2020-02-11  3:45           ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-11  3:45 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Mon, 10 Feb 2020 21:52:51 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, Feb 11, 2020 at 03:44:54AM +0800, Alex Williamson wrote:
> > On Mon, 10 Feb 2020 04:49:54 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:  
> > > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > > > - Start pinned and unpinned pages tracking while migration is active
> > > > - Stop pinned and unpinned dirty pages tracking. This is also used to
> > > >   stop dirty pages tracking if migration failed or cancelled.
> > > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> > > >   user space application responsibility to copy content of dirty pages
> > > >   from source to destination during migration.
> > > > 
> > > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> > > > structure. Bitmap size is calculated considering smallest supported page
> > > > size. Bitmap is allocated when dirty logging is enabled for those
> > > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> > > > case of pass-through device.
> > > > 
> > > > There could be multiple option as to when bitmap should be populated:
> > > > * Polulate bitmap for already pinned pages when bitmap is allocated for
> > > >   a vfio_dma with the smallest supported page size. Updates bitmap from
> > > >   page pinning and unpinning functions. When user application queries
> > > >   bitmap, check if requested page size is same as page size used to
> > > >   populated bitmap. If it is equal, copy bitmap. But if not equal,
> > > >   re-populated bitmap according to requested page size and then copy to
> > > >   user.
> > > >   Pros: Bitmap gets populated on the fly after dirty tracking has
> > > >         started.
> > > >   Cons: If requested page size is different than smallest supported
> > > >         page size, then bitmap has to be re-populated again, with
> > > >         additional overhead of allocating bitmap memory again for
> > > >         re-population of bitmap.
> > > > 
> > > > * Populate bitmap when bitmap is queried by user application.
> > > >   Pros: Bitmap is populated with requested page size. This eliminates
> > > >         the need to re-populate bitmap if requested page size is
> > > >         different than smallest supported pages size.
> > > >   Cons: There is one time processing time, when bitmap is queried.
> > > > 
> > > > I prefer later option with simple logic and to eliminate over-head of
> > > > bitmap repopulation in case of differnt page sizes. Later option is
> > > > implemented in this patch.
> > > > 
> > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> > > >  1 file changed, 287 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index d386461e5d11..df358dc1c85b 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c  
> > [snip]  
> > > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > > >  	return bitmap;
> > > >  }
> > > >  
> > > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > > > +				  size_t size, uint64_t pgsize,
> > > > +				  unsigned char __user *bitmap)
> > > > +{
> > > > +	struct vfio_dma *dma;
> > > > +	dma_addr_t i = iova, iova_limit;
> > > > +	unsigned int bsize, nbits = 0, l = 0;
> > > > +	unsigned long pgshift = __ffs(pgsize);
> > > > +
> > > > +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > > > +		int ret, j;
> > > > +		unsigned int npages = 0, shift = 0;
> > > > +		unsigned char temp = 0;
> > > > +
> > > > +		/* mark all pages dirty if all pages are pinned and mapped. */
> > > > +		if (dma->iommu_mapped) {
> > > > +			iova_limit = min(dma->iova + dma->size, iova + size);
> > > > +			npages = iova_limit/pgsize;
> > > > +			bitmap_set(dma->bitmap, 0, npages);    
> > > for pass-through devices, it's not good to always return all pinned pages as
> > > dirty. could it also call vfio_pin_pages to track dirty pages? or any
> > > other interface provided to do that?  
> > 
> > See patch 7/7.  Thanks,
> >  
> hi Alex and Kirti,
> for pass-through devices, though patch 7/7 enables the vendor driver to
> set dirty pages by calling vfio_pin_pages, however, its overhead is much
> higher than the previous way of generating a bitmap directly to user.
> And it also requires pass-through device vendor driver to track guest
> operations to know when to call vfio_pin_pages.
> There are still use cases like a pass-through device is able to track
> dirty pages in its hardware buffer, so is there a way for it pass its
> dirty bitmap to user?

Not currently and this sounds like another argument in favor of using
the dirty bitmap per vfio_dma to directly track dirty pages.
Passthrough drivers could be provided an interface to set dirty bits
which could be merged with pfn list entries when the user requests the
bitmap, rather than requiring passthrough drivers to unnecessarily
allocate pfn list entries directly.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-11  3:45           ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-11  3:45 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, kvm, eskultet, Yang,
	Ziye, qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Mon, 10 Feb 2020 21:52:51 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, Feb 11, 2020 at 03:44:54AM +0800, Alex Williamson wrote:
> > On Mon, 10 Feb 2020 04:49:54 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:  
> > > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > > > - Start pinned and unpinned pages tracking while migration is active
> > > > - Stop pinned and unpinned dirty pages tracking. This is also used to
> > > >   stop dirty pages tracking if migration failed or cancelled.
> > > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> > > >   user space application responsibility to copy content of dirty pages
> > > >   from source to destination during migration.
> > > > 
> > > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> > > > structure. Bitmap size is calculated considering smallest supported page
> > > > size. Bitmap is allocated when dirty logging is enabled for those
> > > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> > > > case of pass-through device.
> > > > 
> > > > There could be multiple option as to when bitmap should be populated:
> > > > * Polulate bitmap for already pinned pages when bitmap is allocated for
> > > >   a vfio_dma with the smallest supported page size. Updates bitmap from
> > > >   page pinning and unpinning functions. When user application queries
> > > >   bitmap, check if requested page size is same as page size used to
> > > >   populated bitmap. If it is equal, copy bitmap. But if not equal,
> > > >   re-populated bitmap according to requested page size and then copy to
> > > >   user.
> > > >   Pros: Bitmap gets populated on the fly after dirty tracking has
> > > >         started.
> > > >   Cons: If requested page size is different than smallest supported
> > > >         page size, then bitmap has to be re-populated again, with
> > > >         additional overhead of allocating bitmap memory again for
> > > >         re-population of bitmap.
> > > > 
> > > > * Populate bitmap when bitmap is queried by user application.
> > > >   Pros: Bitmap is populated with requested page size. This eliminates
> > > >         the need to re-populate bitmap if requested page size is
> > > >         different than smallest supported pages size.
> > > >   Cons: There is one time processing time, when bitmap is queried.
> > > > 
> > > > I prefer later option with simple logic and to eliminate over-head of
> > > > bitmap repopulation in case of differnt page sizes. Later option is
> > > > implemented in this patch.
> > > > 
> > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> > > >  1 file changed, 287 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index d386461e5d11..df358dc1c85b 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c  
> > [snip]  
> > > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > > >  	return bitmap;
> > > >  }
> > > >  
> > > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > > > +				  size_t size, uint64_t pgsize,
> > > > +				  unsigned char __user *bitmap)
> > > > +{
> > > > +	struct vfio_dma *dma;
> > > > +	dma_addr_t i = iova, iova_limit;
> > > > +	unsigned int bsize, nbits = 0, l = 0;
> > > > +	unsigned long pgshift = __ffs(pgsize);
> > > > +
> > > > +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > > > +		int ret, j;
> > > > +		unsigned int npages = 0, shift = 0;
> > > > +		unsigned char temp = 0;
> > > > +
> > > > +		/* mark all pages dirty if all pages are pinned and mapped. */
> > > > +		if (dma->iommu_mapped) {
> > > > +			iova_limit = min(dma->iova + dma->size, iova + size);
> > > > +			npages = iova_limit/pgsize;
> > > > +			bitmap_set(dma->bitmap, 0, npages);    
> > > for pass-through devices, it's not good to always return all pinned pages as
> > > dirty. could it also call vfio_pin_pages to track dirty pages? or any
> > > other interface provided to do that?  
> > 
> > See patch 7/7.  Thanks,
> >  
> hi Alex and Kirti,
> for pass-through devices, though patch 7/7 enables the vendor driver to
> set dirty pages by calling vfio_pin_pages, however, its overhead is much
> higher than the previous way of generating a bitmap directly to user.
> And it also requires pass-through device vendor driver to track guest
> operations to know when to call vfio_pin_pages.
> There are still use cases like a pass-through device is able to track
> dirty pages in its hardware buffer, so is there a way for it pass its
> dirty bitmap to user?

Not currently and this sounds like another argument in favor of using
the dirty bitmap per vfio_dma to directly track dirty pages.
Passthrough drivers could be provided an interface to set dirty bits
which could be merged with pfn list entries when the user requests the
bitmap, rather than requiring passthrough drivers to unnecessarily
allocate pfn list entries directly.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-11  3:45           ` Alex Williamson
@ 2020-02-11  4:11             ` Yan Zhao
  -1 siblings, 0 replies; 62+ messages in thread
From: Yan Zhao @ 2020-02-11  4:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Tue, Feb 11, 2020 at 11:45:43AM +0800, Alex Williamson wrote:
> On Mon, 10 Feb 2020 21:52:51 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Feb 11, 2020 at 03:44:54AM +0800, Alex Williamson wrote:
> > > On Mon, 10 Feb 2020 04:49:54 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:  
> > > > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > > > > - Start pinned and unpinned pages tracking while migration is active
> > > > > - Stop pinned and unpinned dirty pages tracking. This is also used to
> > > > >   stop dirty pages tracking if migration failed or cancelled.
> > > > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> > > > >   user space application responsibility to copy content of dirty pages
> > > > >   from source to destination during migration.
> > > > > 
> > > > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> > > > > structure. Bitmap size is calculated considering smallest supported page
> > > > > size. Bitmap is allocated when dirty logging is enabled for those
> > > > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> > > > > case of pass-through device.
> > > > > 
> > > > > There could be multiple option as to when bitmap should be populated:
> > > > > * Polulate bitmap for already pinned pages when bitmap is allocated for
> > > > >   a vfio_dma with the smallest supported page size. Updates bitmap from
> > > > >   page pinning and unpinning functions. When user application queries
> > > > >   bitmap, check if requested page size is same as page size used to
> > > > >   populated bitmap. If it is equal, copy bitmap. But if not equal,
> > > > >   re-populated bitmap according to requested page size and then copy to
> > > > >   user.
> > > > >   Pros: Bitmap gets populated on the fly after dirty tracking has
> > > > >         started.
> > > > >   Cons: If requested page size is different than smallest supported
> > > > >         page size, then bitmap has to be re-populated again, with
> > > > >         additional overhead of allocating bitmap memory again for
> > > > >         re-population of bitmap.
> > > > > 
> > > > > * Populate bitmap when bitmap is queried by user application.
> > > > >   Pros: Bitmap is populated with requested page size. This eliminates
> > > > >         the need to re-populate bitmap if requested page size is
> > > > >         different than smallest supported pages size.
> > > > >   Cons: There is one time processing time, when bitmap is queried.
> > > > > 
> > > > > I prefer later option with simple logic and to eliminate over-head of
> > > > > bitmap repopulation in case of differnt page sizes. Later option is
> > > > > implemented in this patch.
> > > > > 
> > > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > > ---
> > > > >  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> > > > >  1 file changed, 287 insertions(+), 12 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > > index d386461e5d11..df358dc1c85b 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c  
> > > [snip]  
> > > > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > > > >  	return bitmap;
> > > > >  }
> > > > >  
> > > > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > > > > +				  size_t size, uint64_t pgsize,
> > > > > +				  unsigned char __user *bitmap)
> > > > > +{
> > > > > +	struct vfio_dma *dma;
> > > > > +	dma_addr_t i = iova, iova_limit;
> > > > > +	unsigned int bsize, nbits = 0, l = 0;
> > > > > +	unsigned long pgshift = __ffs(pgsize);
> > > > > +
> > > > > +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > > > > +		int ret, j;
> > > > > +		unsigned int npages = 0, shift = 0;
> > > > > +		unsigned char temp = 0;
> > > > > +
> > > > > +		/* mark all pages dirty if all pages are pinned and mapped. */
> > > > > +		if (dma->iommu_mapped) {
> > > > > +			iova_limit = min(dma->iova + dma->size, iova + size);
> > > > > +			npages = iova_limit/pgsize;
> > > > > +			bitmap_set(dma->bitmap, 0, npages);    
> > > > for pass-through devices, it's not good to always return all pinned pages as
> > > > dirty. could it also call vfio_pin_pages to track dirty pages? or any
> > > > other interface provided to do that?  
> > > 
> > > See patch 7/7.  Thanks,
> > >  
> > hi Alex and Kirti,
> > for pass-through devices, though patch 7/7 enables the vendor driver to
> > set dirty pages by calling vfio_pin_pages, however, its overhead is much
> > higher than the previous way of generating a bitmap directly to user.
> > And it also requires pass-through device vendor driver to track guest
> > operations to know when to call vfio_pin_pages.
> > There are still use cases like a pass-through device is able to track
> > dirty pages in its hardware buffer, so is there a way for it pass its
> > dirty bitmap to user?
> 
> Not currently and this sounds like another argument in favor of using
> the dirty bitmap per vfio_dma to directly track dirty pages.
it may need an interface to get max iova in all vfio_dma and then generate a
hardware bitmap for the whole guest system memory.

> Passthrough drivers could be provided an interface to set dirty bits
> which could be merged with pfn list entries when the user requests the
> bitmap, rather than requiring passthrough drivers to unnecessarily
> allocate pfn list entries directly.  Thanks,
yes, it's better.
and for devices with ability to track dirty pages in hardware,
maybe an interface to let vfio know where is the hardware bitmap?

Thanks
Yan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-11  4:11             ` Yan Zhao
  0 siblings, 0 replies; 62+ messages in thread
From: Yan Zhao @ 2020-02-11  4:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, kvm, eskultet, Yang,
	Ziye, qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, Kirti Wankhede, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Tue, Feb 11, 2020 at 11:45:43AM +0800, Alex Williamson wrote:
> On Mon, 10 Feb 2020 21:52:51 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Feb 11, 2020 at 03:44:54AM +0800, Alex Williamson wrote:
> > > On Mon, 10 Feb 2020 04:49:54 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote:  
> > > > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> > > > > - Start pinned and unpinned pages tracking while migration is active
> > > > > - Stop pinned and unpinned dirty pages tracking. This is also used to
> > > > >   stop dirty pages tracking if migration failed or cancelled.
> > > > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> > > > >   user space application responsibility to copy content of dirty pages
> > > > >   from source to destination during migration.
> > > > > 
> > > > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> > > > > structure. Bitmap size is calculated considering smallest supported page
> > > > > size. Bitmap is allocated when dirty logging is enabled for those
> > > > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> > > > > case of pass-through device.
> > > > > 
> > > > > There could be multiple option as to when bitmap should be populated:
> > > > > * Polulate bitmap for already pinned pages when bitmap is allocated for
> > > > >   a vfio_dma with the smallest supported page size. Updates bitmap from
> > > > >   page pinning and unpinning functions. When user application queries
> > > > >   bitmap, check if requested page size is same as page size used to
> > > > >   populated bitmap. If it is equal, copy bitmap. But if not equal,
> > > > >   re-populated bitmap according to requested page size and then copy to
> > > > >   user.
> > > > >   Pros: Bitmap gets populated on the fly after dirty tracking has
> > > > >         started.
> > > > >   Cons: If requested page size is different than smallest supported
> > > > >         page size, then bitmap has to be re-populated again, with
> > > > >         additional overhead of allocating bitmap memory again for
> > > > >         re-population of bitmap.
> > > > > 
> > > > > * Populate bitmap when bitmap is queried by user application.
> > > > >   Pros: Bitmap is populated with requested page size. This eliminates
> > > > >         the need to re-populate bitmap if requested page size is
> > > > >         different than smallest supported pages size.
> > > > >   Cons: There is one time processing time, when bitmap is queried.
> > > > > 
> > > > > I prefer later option with simple logic and to eliminate over-head of
> > > > > bitmap repopulation in case of differnt page sizes. Later option is
> > > > > implemented in this patch.
> > > > > 
> > > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > > ---
> > > > >  drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> > > > >  1 file changed, 287 insertions(+), 12 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > > index d386461e5d11..df358dc1c85b 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c  
> > > [snip]  
> > > > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > > > >  	return bitmap;
> > > > >  }
> > > > >  
> > > > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > > > > +				  size_t size, uint64_t pgsize,
> > > > > +				  unsigned char __user *bitmap)
> > > > > +{
> > > > > +	struct vfio_dma *dma;
> > > > > +	dma_addr_t i = iova, iova_limit;
> > > > > +	unsigned int bsize, nbits = 0, l = 0;
> > > > > +	unsigned long pgshift = __ffs(pgsize);
> > > > > +
> > > > > +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> > > > > +		int ret, j;
> > > > > +		unsigned int npages = 0, shift = 0;
> > > > > +		unsigned char temp = 0;
> > > > > +
> > > > > +		/* mark all pages dirty if all pages are pinned and mapped. */
> > > > > +		if (dma->iommu_mapped) {
> > > > > +			iova_limit = min(dma->iova + dma->size, iova + size);
> > > > > +			npages = iova_limit/pgsize;
> > > > > +			bitmap_set(dma->bitmap, 0, npages);    
> > > > for pass-through devices, it's not good to always return all pinned pages as
> > > > dirty. could it also call vfio_pin_pages to track dirty pages? or any
> > > > other interface provided to do that?  
> > > 
> > > See patch 7/7.  Thanks,
> > >  
> > hi Alex and Kirti,
> > for pass-through devices, though patch 7/7 enables the vendor driver to
> > set dirty pages by calling vfio_pin_pages, however, its overhead is much
> > higher than the previous way of generating a bitmap directly to user.
> > And it also requires pass-through device vendor driver to track guest
> > operations to know when to call vfio_pin_pages.
> > There are still use cases like a pass-through device is able to track
> > dirty pages in its hardware buffer, so is there a way for it pass its
> > dirty bitmap to user?
> 
> Not currently and this sounds like another argument in favor of using
> the dirty bitmap per vfio_dma to directly track dirty pages.
it may need an interface to get max iova in all vfio_dma and then generate a
hardware bitmap for the whole guest system memory.

> Passthrough drivers could be provided an interface to set dirty bits
> which could be merged with pfn list entries when the user requests the
> bitmap, rather than requiring passthrough drivers to unnecessarily
> allocate pfn list entries directly.  Thanks,
yes, it's better.
and for devices with ability to track dirty pages in hardware,
maybe an interface to let vfio know where is the hardware bitmap?

Thanks
Yan


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
  2020-02-10 17:25     ` Alex Williamson
@ 2020-02-12 20:56       ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-12 20:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 2/10/2020 10:55 PM, Alex Williamson wrote:
> On Sat, 8 Feb 2020 01:12:28 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Defined MIGRATION region type and sub-type.
>>
>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>    offset of migration region to get/set VFIO device related information.
>>    Defined members of structure and usage on read/write access.
>>
>> - Defined device states and state transition details.
>>
>> - Defined sequence to be followed while saving and resuming VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   include/uapi/linux/vfio.h | 208 ++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 208 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 9e843a147ead..572242620ce9 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>>   #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>>   #define VFIO_REGION_TYPE_GFX                    (1)
>>   #define VFIO_REGION_TYPE_CCW			(2)
>> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>>   
>>   /* sub-types for VFIO_REGION_TYPE_PCI_* */
>>   
>> @@ -379,6 +380,213 @@ struct vfio_region_gfx_edid {
>>   /* sub-types for VFIO_REGION_TYPE_CCW */
>>   #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>>   
>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
>> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>> +
>> +/*
>> + * Structure vfio_device_migration_info is placed at 0th offset of
>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>> + * information. Field accesses from this structure are only supported at their
>> + * native width and alignment, otherwise the result is undefined and vendor
>> + * drivers should return an error.
>> + *
>> + * device_state: (read/write)
>> + *      - User application writes this field to inform vendor driver about the
>> + *        device state to be transitioned to.
>> + *      - Vendor driver should take necessary actions to change device state.
>> + *        On successful transition to given state, vendor driver should return
>> + *        success on write(device_state, state) system call. If device state
>> + *        transition fails, vendor driver should return error, -EFAULT.
> 
> s/error, -EFAULT/an appropriate -errno for the fault condition/
> 
>> + *      - On user application side, if device state transition fails, i.e. if
>> + *        write(device_state, state) returns error, read device_state again to
>> + *        determine the current state of the device from vendor driver.
>> + *      - Vendor driver should return previous state of the device unless vendor
>> + *        driver has encountered an internal error, in which case vendor driver
>> + *        may report the device_state VFIO_DEVICE_STATE_ERROR.
>> + *	- User application must use the device reset ioctl in order to recover
>> + *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
>> + *	  indicated in a valid device state via reading device_state, the user
>> + *	  application may decide attempt to transition the device to any valid
>> + *	  state reachable from the current state or terminate itself.
>> + *
>> + *      device_state consists of 3 bits:
>> + *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
>> + *	  indicates _STOP state. When device is changed to _STOP, driver should
>> + *	  stop device before write() returns.
>> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
>> + *        should start gathering device state information which will be provided
>> + *        to VFIO user application to save device's state.
>> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
>> + *        prepare to resume device, data provided through migration region
>> + *        should be used to resume device.
>> + *      Bits 3 - 31 are reserved for future use. In order to preserve them,
>> + *	user application should perform read-modify-write operation on this
>> + *	field when modifying the specified bits.
>> + *
>> + *  +------- _RESUMING
>> + *  |+------ _SAVING
>> + *  ||+----- _RUNNING
>> + *  |||
>> + *  000b => Device Stopped, not saving or resuming
>> + *  001b => Device running state, default state
>> + *  010b => Stop Device & save device state, stop-and-copy state
>> + *  011b => Device running and save device state, pre-copy state
>> + *  100b => Device stopped and device state is resuming
>> + *  101b => Invalid state
>> + *  110b => Error state
>> + *  111b => Invalid state
>> + *
>> + * State transitions:
>> + *
>> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
>> + *                (100b)     (001b)     (011b)        (010b)       (000b)
>> + * 0. Running or Default state
>> + *                             |
>> + *
>> + * 1. Normal Shutdown (optional)
>> + *                             |------------------------------------->|
>> + *
>> + * 2. Save state or Suspend
>> + *                             |------------------------->|---------->|
>> + *
>> + * 3. Save state during live migration
>> + *                             |----------->|------------>|---------->|
>> + *
>> + * 4. Resuming
>> + *                  |<---------|
>> + *
>> + * 5. Resumed
>> + *                  |--------->|
>> + *
>> + * 0. Default state of VFIO device is _RUNNNG when user application starts.
>> + * 1. During normal user application shutdown, vfio device state changes
>> + *    from _RUNNING to _STOP. This is optional, user application may or may not
>> + *    perform this state transition and vendor driver may not need.
> 
> s/may not need/must not require, but must support this transition/
> 
>> + * 2. When user application save state or suspend application, device state
>> + *    transitions from _RUNNING to stop-and-copy state and then to _STOP.
>> + *    On state transition from _RUNNING to stop-and-copy, driver must
>> + *    stop device, save device state and send it to application through
>> + *    migration region. Sequence to be followed for such transition is given
>> + *    below.
>> + * 3. In user application live migration, state transitions from _RUNNING
>> + *    to pre-copy to stop-and-copy to _STOP.
>> + *    On state transition from _RUNNING to pre-copy, driver should start
>> + *    gathering device state while application is still running and send device
>> + *    state data to application through migration region.
>> + *    On state transition from pre-copy to stop-and-copy, driver must stop
>> + *    device, save device state and send it to user application through
>> + *    migration region.
>> + *    Sequence to be followed for above two transitions is given below.
> 
> Perhaps adding something like "Vendor drivers must support the pre-copy
> state even for implementations where no data is provided to the user
> until the stop-and-copy state.  The user must not be required to
> consume all migration data prior to transitioning to a new state,
> including the stop-and-copy state."
> 
>> + * 4. To start resuming phase, device state should be transitioned from
>> + *    _RUNNING to _RESUMING state.
>> + *    In _RESUMING state, driver should use received device state data through
>> + *    migration region to resume device.
>> + * 5. On providing saved device data to driver, application should change state
>> + *    from _RESUMING to _RUNNING.
>> + *
>> + * pending bytes: (read only)
>> + *      Number of pending bytes yet to be migrated from vendor driver
>> + *
>> + * data_offset: (read only)
>> + *      User application should read data_offset in migration region from where
>> + *      user application should read device data during _SAVING state or write
>> + *      device data during _RESUMING state. See below for detail of sequence to
>> + *      be followed.
>> + *
>> + * data_size: (read/write)
>> + *      User application should read data_size to get size of data copied in
>> + *      bytes in migration region during _SAVING state and write size of data
>> + *      copied in bytes in migration region during _RESUMING state.
>> + *
>> + * Migration region looks like:
>> + *  ------------------------------------------------------------------
>> + * |vfio_device_migration_info|    data section                      |
>> + * |                          |     ///////////////////////////////  |
>> + * ------------------------------------------------------------------
>> + *   ^                              ^
>> + *  offset 0-trapped part        data_offset
>> + *
>> + * Structure vfio_device_migration_info is always followed by data section in
>> + * the region, so data_offset will always be non-0. Offset from where data is
>> + * copied is decided by kernel driver, data section can be trapped or mapped
>> + * or partitioned, depending on how kernel driver defines data section.
>> + * Data section partition can be defined as mapped by sparse mmap capability.
>> + * If mmapped, then data_offset should be page aligned, where as initial section
>> + * which contain vfio_device_migration_info structure might not end at offset
>> + * which is page aligned. The user is not required to access via mmap regardless
>> + * of the region mmap capabilities.
>> + * Vendor driver should decide whether to partition data section and how to
>> + * partition the data section. Vendor driver should return data_offset
>> + * accordingly.
>> + *
>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>> + * and for _SAVING device state or stop-and-copy phase:
>> + * a. read pending_bytes, indicates start of new iteration to get device data.
>> + *    Repeatative read on pending_bytes at this stage should have no side
>> + *    effect.
> 
> s/Repeatative/Repeated/
> 
>> + *    If pending_bytes == 0, user application should not iterate to get data
>> + *    for that device.
>> + *    If pending_bytes > 0, go through below steps.
>> + * b. read data_offset, indicates vendor driver to make data available through
>> + *    data section. Vendor driver should return this read operation only after
>> + *    data is available from (region + data_offset) to (region + data_offset +
>> + *    data_size).
>> + * c. read data_size, amount of data in bytes available through migration
>> + *    region.
>> + *    Read on data_offset and data_size should return offset and size of current
>> + *    buffer if user application reads those more than once here.
>> + * d. read data of data_size bytes from (region + data_offset) from migration
>> + *    region.
>> + * e. process data.
>> + * f. read pending_bytes, this read operation indicates data from previous
>> + *    iteration had read. If pending_bytes > 0, goto step b.
>> + *
>> + * If there is any error during the above sequence, vendor driver can return
>> + * error code for next read()/write() operation, that will terminate the loop
>> + * and user should take next necessary action, for example, fail migration or
>> + * terminate user application.
>> + *
>> + * User application can transition from _SAVING|_RUNNING (pre-copy state) to
>> + * _SAVING (stop-and-copy) state regardless of pending bytes.
> 
> Ok, you cover one of my concerns above here.  Maybe doesn't hurt to
> mention in both places.
> 
>> + * User application should iterate in _SAVING (stop-and-copy) until
>> + * pending_bytes is 0.
>> + *
>> + * Sequence to be followed while _RESUMING device state:
>> + * While data for this device is available, repeat below steps:
>> + * a. read data_offset from where user application should write data.
>> + * b. write data of data_size to migration region from data_offset. Data size
>> + *    should be data packet size at source during _SAVING.
> 
> I find the reference to data_size a bit confusing in this wording,
> almost as if it's implied that the user reads data_size on the target.
> What if we changed it a little:
> 
>   b. write migration data starting at migration region + data_offset for
>   length determined by data_size from the migration source.
> 
>> + * c. write data_size which indicates vendor driver that data is written in
>> + *    migration region. Vendor driver should read this data from migration
>> + *    region and resume device's state.
> 
> Perhaps "Vendor driver should apply the user provided migration region
> data towards the device resume state"?
> 

Ok. Updating as per all above comments.

>> + *
>> + * For user application, data is opaque. User application should write data in
>> + * the same order as received and should of same transaction size at source.
> 
> Great!
> 
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +	__u32 device_state;         /* VFIO device state */
>> +#define VFIO_DEVICE_STATE_STOP      (0)
>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>> +				     VFIO_DEVICE_STATE_SAVING |  \
>> +				     VFIO_DEVICE_STATE_RESUMING)
>> +
>> +#define VFIO_DEVICE_STATE_VALID(state) \
>> +	(state & VFIO_DEVICE_STATE_RESUMING ? \
>> +	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
>> +
>> +#define VFIO_DEVICE_STATE_ERROR			\
>> +		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)
> 
> It looks like this isn't used in this series, so I'm not sure the
> intention of this macro, but I think we decided to only use 110b as the
> "error" state.  So should this be something like
> 
> #define VFIO_DEVICE_STATE_IS_ERROR(state) \
> 	(state & VFIO_DEVICE_STATE_MASK == (VFIO_DEVICE_STATE_SAVING | \
> 					    VFIO_DEVICE_STATE_RESUMING))
> 
> Or if this was intended to be used in setting the device_state to
> error, perhaps
> 
> #define VFIO_DEVICE_STATE_SET_ERROR(state) \
> 	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> 					     VFIO_DEVICE_STATE_RESUMING)

This is also intended to to set device_state, vendor driver would set 
error state. Adding both above macros.

>> +
>> +	__u32 reserved;
> 
> Can we specify this reserved field as reads return zero, writes are
> ignored so that we give ourselves the opportunity to re-purpose it
> later?
> 
>

Ok. Adding

Thanks,
Kirti

> +	__u64 pending_bytes;
>> +	__u64 data_offset;
>> +	__u64 data_size;
>> +} __attribute__((packed));
>> +
>>   /*
>>    * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>    * which allows direct access to non-MSIX registers which happened to be within
> 
> Thanks,
> Alex
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
@ 2020-02-12 20:56       ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-12 20:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue



On 2/10/2020 10:55 PM, Alex Williamson wrote:
> On Sat, 8 Feb 2020 01:12:28 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> - Defined MIGRATION region type and sub-type.
>>
>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>    offset of migration region to get/set VFIO device related information.
>>    Defined members of structure and usage on read/write access.
>>
>> - Defined device states and state transition details.
>>
>> - Defined sequence to be followed while saving and resuming VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   include/uapi/linux/vfio.h | 208 ++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 208 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 9e843a147ead..572242620ce9 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>>   #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>>   #define VFIO_REGION_TYPE_GFX                    (1)
>>   #define VFIO_REGION_TYPE_CCW			(2)
>> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>>   
>>   /* sub-types for VFIO_REGION_TYPE_PCI_* */
>>   
>> @@ -379,6 +380,213 @@ struct vfio_region_gfx_edid {
>>   /* sub-types for VFIO_REGION_TYPE_CCW */
>>   #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>>   
>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
>> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>> +
>> +/*
>> + * Structure vfio_device_migration_info is placed at 0th offset of
>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>> + * information. Field accesses from this structure are only supported at their
>> + * native width and alignment, otherwise the result is undefined and vendor
>> + * drivers should return an error.
>> + *
>> + * device_state: (read/write)
>> + *      - User application writes this field to inform vendor driver about the
>> + *        device state to be transitioned to.
>> + *      - Vendor driver should take necessary actions to change device state.
>> + *        On successful transition to given state, vendor driver should return
>> + *        success on write(device_state, state) system call. If device state
>> + *        transition fails, vendor driver should return error, -EFAULT.
> 
> s/error, -EFAULT/an appropriate -errno for the fault condition/
> 
>> + *      - On user application side, if device state transition fails, i.e. if
>> + *        write(device_state, state) returns error, read device_state again to
>> + *        determine the current state of the device from vendor driver.
>> + *      - Vendor driver should return previous state of the device unless vendor
>> + *        driver has encountered an internal error, in which case vendor driver
>> + *        may report the device_state VFIO_DEVICE_STATE_ERROR.
>> + *	- User application must use the device reset ioctl in order to recover
>> + *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
>> + *	  indicated in a valid device state via reading device_state, the user
>> + *	  application may decide attempt to transition the device to any valid
>> + *	  state reachable from the current state or terminate itself.
>> + *
>> + *      device_state consists of 3 bits:
>> + *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
>> + *	  indicates _STOP state. When device is changed to _STOP, driver should
>> + *	  stop device before write() returns.
>> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
>> + *        should start gathering device state information which will be provided
>> + *        to VFIO user application to save device's state.
>> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
>> + *        prepare to resume device, data provided through migration region
>> + *        should be used to resume device.
>> + *      Bits 3 - 31 are reserved for future use. In order to preserve them,
>> + *	user application should perform read-modify-write operation on this
>> + *	field when modifying the specified bits.
>> + *
>> + *  +------- _RESUMING
>> + *  |+------ _SAVING
>> + *  ||+----- _RUNNING
>> + *  |||
>> + *  000b => Device Stopped, not saving or resuming
>> + *  001b => Device running state, default state
>> + *  010b => Stop Device & save device state, stop-and-copy state
>> + *  011b => Device running and save device state, pre-copy state
>> + *  100b => Device stopped and device state is resuming
>> + *  101b => Invalid state
>> + *  110b => Error state
>> + *  111b => Invalid state
>> + *
>> + * State transitions:
>> + *
>> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
>> + *                (100b)     (001b)     (011b)        (010b)       (000b)
>> + * 0. Running or Default state
>> + *                             |
>> + *
>> + * 1. Normal Shutdown (optional)
>> + *                             |------------------------------------->|
>> + *
>> + * 2. Save state or Suspend
>> + *                             |------------------------->|---------->|
>> + *
>> + * 3. Save state during live migration
>> + *                             |----------->|------------>|---------->|
>> + *
>> + * 4. Resuming
>> + *                  |<---------|
>> + *
>> + * 5. Resumed
>> + *                  |--------->|
>> + *
>> + * 0. Default state of VFIO device is _RUNNNG when user application starts.
>> + * 1. During normal user application shutdown, vfio device state changes
>> + *    from _RUNNING to _STOP. This is optional, user application may or may not
>> + *    perform this state transition and vendor driver may not need.
> 
> s/may not need/must not require, but must support this transition/
> 
>> + * 2. When user application save state or suspend application, device state
>> + *    transitions from _RUNNING to stop-and-copy state and then to _STOP.
>> + *    On state transition from _RUNNING to stop-and-copy, driver must
>> + *    stop device, save device state and send it to application through
>> + *    migration region. Sequence to be followed for such transition is given
>> + *    below.
>> + * 3. In user application live migration, state transitions from _RUNNING
>> + *    to pre-copy to stop-and-copy to _STOP.
>> + *    On state transition from _RUNNING to pre-copy, driver should start
>> + *    gathering device state while application is still running and send device
>> + *    state data to application through migration region.
>> + *    On state transition from pre-copy to stop-and-copy, driver must stop
>> + *    device, save device state and send it to user application through
>> + *    migration region.
>> + *    Sequence to be followed for above two transitions is given below.
> 
> Perhaps adding something like "Vendor drivers must support the pre-copy
> state even for implementations where no data is provided to the user
> until the stop-and-copy state.  The user must not be required to
> consume all migration data prior to transitioning to a new state,
> including the stop-and-copy state."
> 
>> + * 4. To start resuming phase, device state should be transitioned from
>> + *    _RUNNING to _RESUMING state.
>> + *    In _RESUMING state, driver should use received device state data through
>> + *    migration region to resume device.
>> + * 5. On providing saved device data to driver, application should change state
>> + *    from _RESUMING to _RUNNING.
>> + *
>> + * pending bytes: (read only)
>> + *      Number of pending bytes yet to be migrated from vendor driver
>> + *
>> + * data_offset: (read only)
>> + *      User application should read data_offset in migration region from where
>> + *      user application should read device data during _SAVING state or write
>> + *      device data during _RESUMING state. See below for detail of sequence to
>> + *      be followed.
>> + *
>> + * data_size: (read/write)
>> + *      User application should read data_size to get size of data copied in
>> + *      bytes in migration region during _SAVING state and write size of data
>> + *      copied in bytes in migration region during _RESUMING state.
>> + *
>> + * Migration region looks like:
>> + *  ------------------------------------------------------------------
>> + * |vfio_device_migration_info|    data section                      |
>> + * |                          |     ///////////////////////////////  |
>> + * ------------------------------------------------------------------
>> + *   ^                              ^
>> + *  offset 0-trapped part        data_offset
>> + *
>> + * Structure vfio_device_migration_info is always followed by data section in
>> + * the region, so data_offset will always be non-0. Offset from where data is
>> + * copied is decided by kernel driver, data section can be trapped or mapped
>> + * or partitioned, depending on how kernel driver defines data section.
>> + * Data section partition can be defined as mapped by sparse mmap capability.
>> + * If mmapped, then data_offset should be page aligned, where as initial section
>> + * which contain vfio_device_migration_info structure might not end at offset
>> + * which is page aligned. The user is not required to access via mmap regardless
>> + * of the region mmap capabilities.
>> + * Vendor driver should decide whether to partition data section and how to
>> + * partition the data section. Vendor driver should return data_offset
>> + * accordingly.
>> + *
>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>> + * and for _SAVING device state or stop-and-copy phase:
>> + * a. read pending_bytes, indicates start of new iteration to get device data.
>> + *    Repeatative read on pending_bytes at this stage should have no side
>> + *    effect.
> 
> s/Repeatative/Repeated/
> 
>> + *    If pending_bytes == 0, user application should not iterate to get data
>> + *    for that device.
>> + *    If pending_bytes > 0, go through below steps.
>> + * b. read data_offset, indicates vendor driver to make data available through
>> + *    data section. Vendor driver should return this read operation only after
>> + *    data is available from (region + data_offset) to (region + data_offset +
>> + *    data_size).
>> + * c. read data_size, amount of data in bytes available through migration
>> + *    region.
>> + *    Read on data_offset and data_size should return offset and size of current
>> + *    buffer if user application reads those more than once here.
>> + * d. read data of data_size bytes from (region + data_offset) from migration
>> + *    region.
>> + * e. process data.
>> + * f. read pending_bytes, this read operation indicates data from previous
>> + *    iteration had read. If pending_bytes > 0, goto step b.
>> + *
>> + * If there is any error during the above sequence, vendor driver can return
>> + * error code for next read()/write() operation, that will terminate the loop
>> + * and user should take next necessary action, for example, fail migration or
>> + * terminate user application.
>> + *
>> + * User application can transition from _SAVING|_RUNNING (pre-copy state) to
>> + * _SAVING (stop-and-copy) state regardless of pending bytes.
> 
> Ok, you cover one of my concerns above here.  Maybe doesn't hurt to
> mention in both places.
> 
>> + * User application should iterate in _SAVING (stop-and-copy) until
>> + * pending_bytes is 0.
>> + *
>> + * Sequence to be followed while _RESUMING device state:
>> + * While data for this device is available, repeat below steps:
>> + * a. read data_offset from where user application should write data.
>> + * b. write data of data_size to migration region from data_offset. Data size
>> + *    should be data packet size at source during _SAVING.
> 
> I find the reference to data_size a bit confusing in this wording,
> almost as if it's implied that the user reads data_size on the target.
> What if we changed it a little:
> 
>   b. write migration data starting at migration region + data_offset for
>   length determined by data_size from the migration source.
> 
>> + * c. write data_size which indicates vendor driver that data is written in
>> + *    migration region. Vendor driver should read this data from migration
>> + *    region and resume device's state.
> 
> Perhaps "Vendor driver should apply the user provided migration region
> data towards the device resume state"?
> 

Ok. Updating as per all above comments.

>> + *
>> + * For user application, data is opaque. User application should write data in
>> + * the same order as received and should of same transaction size at source.
> 
> Great!
> 
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +	__u32 device_state;         /* VFIO device state */
>> +#define VFIO_DEVICE_STATE_STOP      (0)
>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>> +				     VFIO_DEVICE_STATE_SAVING |  \
>> +				     VFIO_DEVICE_STATE_RESUMING)
>> +
>> +#define VFIO_DEVICE_STATE_VALID(state) \
>> +	(state & VFIO_DEVICE_STATE_RESUMING ? \
>> +	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
>> +
>> +#define VFIO_DEVICE_STATE_ERROR			\
>> +		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)
> 
> It looks like this isn't used in this series, so I'm not sure the
> intention of this macro, but I think we decided to only use 110b as the
> "error" state.  So should this be something like
> 
> #define VFIO_DEVICE_STATE_IS_ERROR(state) \
> 	(state & VFIO_DEVICE_STATE_MASK == (VFIO_DEVICE_STATE_SAVING | \
> 					    VFIO_DEVICE_STATE_RESUMING))
> 
> Or if this was intended to be used in setting the device_state to
> error, perhaps
> 
> #define VFIO_DEVICE_STATE_SET_ERROR(state) \
> 	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> 					     VFIO_DEVICE_STATE_RESUMING)

This is also intended to to set device_state, vendor driver would set 
error state. Adding both above macros.

>> +
>> +	__u32 reserved;
> 
> Can we specify this reserved field as reads return zero, writes are
> ignored so that we give ourselves the opportunity to re-purpose it
> later?
> 
>

Ok. Adding

Thanks,
Kirti

> +	__u64 pending_bytes;
>> +	__u64 data_offset;
>> +	__u64 data_size;
>> +} __attribute__((packed));
>> +
>>   /*
>>    * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>    * which allows direct access to non-MSIX registers which happened to be within
> 
> Thanks,
> Alex
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-10 17:25     ` Alex Williamson
@ 2020-02-12 20:56       ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-12 20:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 2/10/2020 10:55 PM, Alex Williamson wrote:
> On Sat, 8 Feb 2020 01:12:31 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
>> - Start pinned and unpinned pages tracking while migration is active
>> - Stop pinned and unpinned dirty pages tracking. This is also used to
>>    stop dirty pages tracking if migration failed or cancelled.
>> - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
>>    user space application responsibility to copy content of dirty pages
>>    from source to destination during migration.
>>
>> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
>> structure. Bitmap size is calculated considering smallest supported page
>> size. Bitmap is allocated when dirty logging is enabled for those
>> vfio_dmas whose vpfn list is not empty or whole range is mapped, in
>> case of pass-through device.
>>
>> There could be multiple option as to when bitmap should be populated:
>> * Polulate bitmap for already pinned pages when bitmap is allocated for
>>    a vfio_dma with the smallest supported page size. Updates bitmap from
>>    page pinning and unpinning functions. When user application queries
>>    bitmap, check if requested page size is same as page size used to
>>    populated bitmap. If it is equal, copy bitmap. But if not equal,
>>    re-populated bitmap according to requested page size and then copy to
>>    user.
>>    Pros: Bitmap gets populated on the fly after dirty tracking has
>>          started.
>>    Cons: If requested page size is different than smallest supported
>>          page size, then bitmap has to be re-populated again, with
>>          additional overhead of allocating bitmap memory again for
>>          re-population of bitmap.
> 
> No memory needs to be allocated to re-populate the bitmap.  The bitmap
> is clear-on-read and by tracking the bitmap in the smallest supported
> page size we can guarantee that we can fit the user requested bitmap
> size within the space occupied by that minimal page size range of the
> bitmap.  Therefore we'd destructively translate the requested region of
> the bitmap to a different page size, write it out to the user, and
> clear it.  Also we expect userspace to use the minimum page size almost
> exclusively, which is optimized by this approach as dirty bit tracking
> is spread out over each page pinning operation.
> 
>>
>> * Populate bitmap when bitmap is queried by user application.
>>    Pros: Bitmap is populated with requested page size. This eliminates
>>          the need to re-populate bitmap if requested page size is
>>          different than smallest supported pages size.
>>    Cons: There is one time processing time, when bitmap is queried.
> 
> Another significant Con is that the vpfn list needs to track and manage
> unpinned pages, which makes it more complex and intrusive.  The
> previous option seems to have both time and complexity advantages,
> especially in the case we expect to be most common of the user
> accessing the bitmap with the minimum page size, ie. PAGE_SIZE.  It's
> also not clear why we pre-allocate the bitmap at all with this approach.
> 
>> I prefer later option with simple logic and to eliminate over-head of
>> bitmap repopulation in case of differnt page sizes. Later option is
>> implemented in this patch.
> 
> Hmm, we'll see below, but I not convinced based on the above rationale.
> 
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 287 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index d386461e5d11..df358dc1c85b 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -70,6 +70,7 @@ struct vfio_iommu {
>>   	unsigned int		dma_avail;
>>   	bool			v2;
>>   	bool			nesting;
>> +	bool			dirty_page_tracking;
>>   };
>>   
>>   struct vfio_domain {
>> @@ -90,6 +91,7 @@ struct vfio_dma {
>>   	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>>   	struct task_struct	*task;
>>   	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>> +	unsigned long		*bitmap;
>>   };
>>   
>>   struct vfio_group {
>> @@ -125,6 +127,7 @@ struct vfio_regions {
>>   					(!list_empty(&iommu->domain_list))
>>   
>>   static int put_pfn(unsigned long pfn, int prot);
>> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
>>   
>>   /*
>>    * This code handles mapping and unmapping of user data buffers
>> @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>>   	rb_erase(&old->node, &iommu->dma_list);
>>   }
>>   
>> +static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
>> +{
>> +	if (!npages)
>> +		return 0;
>> +
>> +	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
>> +}
>> +
>> +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
>> +				 struct vfio_dma *dma, unsigned long pgsizes)
>> +{
>> +	unsigned long pgshift = __ffs(pgsizes);
>> +
>> +	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
>> +		unsigned long npages = dma->size >> pgshift;
>> +		unsigned long bsize = dirty_bitmap_bytes(npages);
>> +
>> +		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);
> 
> nit, we don't need to store bsize in a local variable.
> 
>> +		if (!dma->bitmap)
>> +			return -ENOMEM;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
>> +				     unsigned long pgsizes)
>> +{
>> +	struct rb_node *n = rb_first(&iommu->dma_list);
>> +	int ret;
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
>> +		if (ret)
>> +			return ret;
> 
> This doesn't unwind on failure, so we're left with partially allocated
> bitmap cruft.
>

Good point. Adding unwind on failure.

>> +	}
>> +	return 0;
>> +}
>> +
>> +static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
>> +{
>> +	struct rb_node *n = rb_first(&iommu->dma_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		kfree(dma->bitmap);
> 
> We don't set dma->bitmap = NULL and we don't even prevent the case of a
> user making multiple STOP calls, so we have a user triggerable double
> free :(
> 

Ok.

>> +	}
>> +}
>> +
>>   /*
>>    * Helper Functions for host iova-pfn list
>>    */
>> @@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>>   	kfree(vpfn);
>>   }
>>   
>> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
>> +{
>> +	struct rb_node *n = rb_first(&dma->pfn_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
>> +
>> +		if (!vpfn->ref_count)
>> +			vfio_remove_from_pfn_list(dma, vpfn);
>> +	}
>> +}
>> +
>> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
>> +{
>> +	struct rb_node *n = rb_first(&iommu->dma_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		vfio_remove_unpinned_from_pfn_list(dma);
>> +	}
>> +}
>> +
>>   static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>>   					       unsigned long iova)
>>   {
>> @@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>>   	vpfn->ref_count--;
>>   	if (!vpfn->ref_count) {
>>   		ret = put_pfn(vpfn->pfn, dma->prot);
>> -		vfio_remove_from_pfn_list(dma, vpfn);
>> +		if (!dma->bitmap)
>> +			vfio_remove_from_pfn_list(dma, vpfn);
>>   	}
>>   	return ret;
>>   }
>> @@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>>   	return ret;
>>   }
>>   
>> -static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>> +static int vfio_unpin_page_external(struct vfio_iommu *iommu,
> 
> We added a parameter but didn't use it in this patch.
> 

Ok, Moving it to relevant patch.

>> +				    struct vfio_dma *dma, dma_addr_t iova,
>>   				    bool do_accounting)
>>   {
>>   	int unlocked;
>>   	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
>>   
>> -	if (!vpfn)
>> +	if (!vpfn || !vpfn->ref_count)
>>   		return 0;
>>   
>>   	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
>> @@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   	unsigned long remote_vaddr;
>>   	struct vfio_dma *dma;
>>   	bool do_accounting;
>> +	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>>   
>>   	if (!iommu || !user_pfn || !phys_pfn)
>>   		return -EINVAL;
>> @@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   
>>   		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>>   		if (vpfn) {
>> -			phys_pfn[i] = vpfn->pfn;
>> -			continue;
>> +			if (vpfn->ref_count > 1) {
>> +				phys_pfn[i] = vpfn->pfn;
>> +				continue;
>> +			}
>>   		}
>>   
>>   		remote_vaddr = dma->vaddr + iova - dma->iova;
>> @@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   					     do_accounting);
>>   		if (ret)
>>   			goto pin_unwind;
>> -
>> -		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>> -		if (ret) {
>> -			vfio_unpin_page_external(dma, iova, do_accounting);
>> -			goto pin_unwind;
>> +		if (!vpfn) {
>> +			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>> +			if (ret) {
>> +				vfio_unpin_page_external(iommu, dma, iova,
>> +							 do_accounting);
>> +				goto pin_unwind;
>> +			}
>> +		} else
>> +			vpfn->pfn = phys_pfn[i];
>> +
>> +		if (iommu->dirty_page_tracking && !dma->bitmap) {
>> +			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
>> +			if (ret) {
>> +				vfio_unpin_page_external(iommu, dma, iova,
>> +							 do_accounting);
>> +				goto pin_unwind;
>> +			}
>>   		}
>>   	}
>>   
>> @@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   
>>   		iova = user_pfn[j] << PAGE_SHIFT;
>>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>>   		phys_pfn[j] = 0;
>>   	}
>>   pin_done:
>> @@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>>   		if (!dma)
>>   			goto unpin_exit;
>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>>   	}
>>   
>>   unpin_exit:
>> @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>>   	return bitmap;
>>   }
>>   
>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>> +				  size_t size, uint64_t pgsize,
>> +				  unsigned char __user *bitmap)
>> +{
>> +	struct vfio_dma *dma;
>> +	dma_addr_t i = iova, iova_limit;
>> +	unsigned int bsize, nbits = 0, l = 0;
>> +	unsigned long pgshift = __ffs(pgsize);
>> +
>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
>> +		int ret, j;
>> +		unsigned int npages = 0, shift = 0;
>> +		unsigned char temp = 0;
>> +
>> +		/* mark all pages dirty if all pages are pinned and mapped. */
>> +		if (dma->iommu_mapped) {
>> +			iova_limit = min(dma->iova + dma->size, iova + size);
>> +			npages = iova_limit/pgsize;
>> +			bitmap_set(dma->bitmap, 0, npages);
> 
> npages is derived from iova_limit, which is the number of bits to set
> dirty relative to the first requested iova, not iova zero, ie. the set
> of dirty bits is offset from those requested unless iova == dma->iova.
> 

Right, fixing.

> Also I hope dma->bitmap was actually allocated.  Not only does the
> START error path potentially leave dirty tracking enabled without all
> the bitmap allocated, when does the bitmap get allocated for a new
> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
> vpfn gets marked dirty.
> 

Right.

Fixing error paths.


>> +		} else if (dma->bitmap) {
>> +			struct rb_node *n = rb_first(&dma->pfn_list);
>> +			bool found = false;
>> +
>> +			for (; n; n = rb_next(n)) {
>> +				struct vfio_pfn *vpfn = rb_entry(n,
>> +						struct vfio_pfn, node);
>> +				if (vpfn->iova >= i) {
>> +					found = true;
>> +					break;
>> +				}
>> +			}
>> +
>> +			if (!found) {
>> +				i += dma->size;
>> +				continue;
>> +			}
>> +
>> +			for (; n; n = rb_next(n)) {
>> +				unsigned int s;
>> +				struct vfio_pfn *vpfn = rb_entry(n,
>> +						struct vfio_pfn, node);
>> +
>> +				if (vpfn->iova >= iova + size)
>> +					break;
>> +
>> +				s = (vpfn->iova - dma->iova) >> pgshift;
>> +				bitmap_set(dma->bitmap, s, 1);
>> +
>> +				iova_limit = vpfn->iova + pgsize;
>> +			}
>> +			npages = iova_limit/pgsize;
> 
> Isn't iova_limit potentially uninitialized here?  For example, if our
> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
> 4096 and break, and npages = ????/pgsize.
> 

Right, Fixing it.

>> +		}
>> +
>> +		bsize = dirty_bitmap_bytes(npages);
>> +		shift = nbits % BITS_PER_BYTE;
>> +
>> +		if (npages && shift) {
>> +			l--;
>> +			if (!access_ok((void __user *)bitmap + l,
>> +					sizeof(unsigned char)))
>> +				return -EINVAL;
>> +
>> +			ret = __get_user(temp, bitmap + l);
> 
> I don't understand why we care to get the user's bitmap, are we trying
> to leave whatever garbage they might have set in it and only also set
> the dirty bits?  That seems unnecessary.
> 

Suppose dma mapped ranges are {start, size}:
{0, 0xa000}, {0xa000, 0x10000}

Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
Then in first iteration for dma {0,0xa000} there are 10 pages, so 10 
bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
In second iteration for dma {0xa000, 0x10000} there are 6 pages and 
these bits should be appended to previous byte. So get_user() that byte, 
then shift-OR rest of the bitmap, result should be: (11111111 11111111b)

Without get_user() and shift-OR, resulting bitmap would be
111111 00000011 11111111b which would be wrong.

> Also why do we need these access_ok() checks when we already checked
> the range at the start of the ioctl?

Since pointer is updated runtime here, better to check that pointer 
before using that pointer.

> 
>> +			if (ret)
>> +				return ret;
>> +		}
>> +
>> +		for (j = 0; j < bsize; j++, l++) {
>> +			temp = temp |
>> +			       (*((unsigned char *)dma->bitmap + j) << shift);
> 
> |=
> 
>> +			if (!access_ok((void __user *)bitmap + l,
>> +					sizeof(unsigned char)))
>> +				return -EINVAL;
>> +
>> +			ret = __put_user(temp, bitmap + l);
>> +			if (ret)
>> +				return ret;
>> +			if (shift) {
>> +				temp = *((unsigned char *)dma->bitmap + j) >>
>> +					(BITS_PER_BYTE - shift);
>> +			}
> 
> When shift == 0, temp just seems to accumulate bits that never get
> cleared.
> 

Hope example above explains the shift logic.

>> +		}
>> +
>> +		nbits += npages;
>> +
>> +		i = min(dma->iova + dma->size, iova + size);
>> +		if (i >= iova + size)
>> +			break;
> 
> So whether we error or succeed, we leave cruft in dma->bitmap for the
> next pass.  It doesn't seem to make any sense why we pre-allocated the
> bitmap, we might as well just allocate it on demand here.  Actually, if
> we're not going to do a copy_to_user() for some range of the bitmap,
> I'm not sure what it's purpose is at all.  I think the big advantages
> of the bitmap are that we can't amortize the cost across every pinned
> page or DMA mapping, we don't need the overhead of tracking unmapped
> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
> getting any of those advantages here.
> 

That would still not work if dma range size is not multiples of 8 pages. 
See example above.


>> +	}
>> +	return 0;
>> +}
>> +
>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>> +{
>> +	long bsize;
>> +
>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
>> +		return -EINVAL;
>> +
>> +	bsize = dirty_bitmap_bytes(npages);
>> +
>> +	if (bitmap_size < bsize)
>> +		return -EINVAL;
>> +
>> +	return bsize;
>> +}
> 
> Seems like this could simply return int, -errno or zero for success.
> The returned bsize is not used for anything else.
> 

ok.

>> +
>>   static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>   			     struct vfio_iommu_type1_dma_unmap *unmap)
>>   {
>> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>   
>>   		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>   			-EFAULT : 0;
>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>> +		struct vfio_iommu_type1_dirty_bitmap range;
>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>> +		int ret;
>> +
>> +		if (!iommu->v2)
>> +			return -EACCES;
>> +
>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>> +				    bitmap);
> 
> We require the user to provide iova, size, pgsize, bitmap_size, and
> bitmap fields to START/STOP?  Why?
>

No. But those are part of structure.

>> +
>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (range.argsz < minsz || range.flags & ~mask)
>> +			return -EINVAL;
>> +
>> +		/* only one flag should be set at a time */
>> +		if (__ffs(range.flags) != __fls(range.flags))
>> +			return -EINVAL;
>> +
>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>> +
>> +			mutex_lock(&iommu->lock);
>> +			iommu->dirty_page_tracking = true;
>> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);
> 
> So dirty page tracking is enabled even if we fail to allocate all the
> bitmaps?  Shouldn't this return an error if dirty tracking is already
> enabled?
> 

Adding error handling here in next patch.

>> +			mutex_unlock(&iommu->lock);
>> +			return ret;
>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>> +			mutex_lock(&iommu->lock);
>> +			iommu->dirty_page_tracking = false;
> 
> Shouldn't we only allow STOP if tracking is enabled?
> 

Right,adding.

>> +			vfio_dma_all_bitmap_free(iommu);
> 
> Here's where that user induced double free enters the picture.
> 

Error handling as mentioned above will prevent double free.

Thanks,
Kirti

>> +			vfio_remove_unpinned_from_dma_list(iommu);
>> +			mutex_unlock(&iommu->lock);
>> +			return 0;
>> +		} else if (range.flags &
>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>> +			long bsize;
>> +			unsigned long pgshift = __ffs(range.pgsize);
>> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>> +			uint64_t iommu_pgmask =
>> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
>> +
>> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
>> +				return -EINVAL;
>> +			if (range.iova & iommu_pgmask)
>> +				return -EINVAL;
>> +			if (!range.size || range.size & iommu_pgmask)
>> +				return -EINVAL;
>> +			if (range.iova + range.size < range.iova)
>> +				return -EINVAL;
>> +			if (!access_ok((void __user *)range.bitmap,
>> +				       range.bitmap_size))
>> +				return -EINVAL;
>> +
>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>> +						   range.bitmap_size);
>> +			if (bsize < 0)
>> +				return bsize;
>> +
>> +			mutex_lock(&iommu->lock);
>> +			if (iommu->dirty_page_tracking)
>> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
>> +					 range.size, range.pgsize,
>> +					 (unsigned char __user *)range.bitmap);
>> +			else
>> +				ret = -EINVAL;
>> +			mutex_unlock(&iommu->lock);
>> +
>> +			return ret;
>> +		}
>>   	}
>>   
>>   	return -ENOTTY;
> 
> Thanks,
> Alex
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-12 20:56       ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-12 20:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue



On 2/10/2020 10:55 PM, Alex Williamson wrote:
> On Sat, 8 Feb 2020 01:12:31 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
>> - Start pinned and unpinned pages tracking while migration is active
>> - Stop pinned and unpinned dirty pages tracking. This is also used to
>>    stop dirty pages tracking if migration failed or cancelled.
>> - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
>>    user space application responsibility to copy content of dirty pages
>>    from source to destination during migration.
>>
>> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
>> structure. Bitmap size is calculated considering smallest supported page
>> size. Bitmap is allocated when dirty logging is enabled for those
>> vfio_dmas whose vpfn list is not empty or whole range is mapped, in
>> case of pass-through device.
>>
>> There could be multiple option as to when bitmap should be populated:
>> * Polulate bitmap for already pinned pages when bitmap is allocated for
>>    a vfio_dma with the smallest supported page size. Updates bitmap from
>>    page pinning and unpinning functions. When user application queries
>>    bitmap, check if requested page size is same as page size used to
>>    populated bitmap. If it is equal, copy bitmap. But if not equal,
>>    re-populated bitmap according to requested page size and then copy to
>>    user.
>>    Pros: Bitmap gets populated on the fly after dirty tracking has
>>          started.
>>    Cons: If requested page size is different than smallest supported
>>          page size, then bitmap has to be re-populated again, with
>>          additional overhead of allocating bitmap memory again for
>>          re-population of bitmap.
> 
> No memory needs to be allocated to re-populate the bitmap.  The bitmap
> is clear-on-read and by tracking the bitmap in the smallest supported
> page size we can guarantee that we can fit the user requested bitmap
> size within the space occupied by that minimal page size range of the
> bitmap.  Therefore we'd destructively translate the requested region of
> the bitmap to a different page size, write it out to the user, and
> clear it.  Also we expect userspace to use the minimum page size almost
> exclusively, which is optimized by this approach as dirty bit tracking
> is spread out over each page pinning operation.
> 
>>
>> * Populate bitmap when bitmap is queried by user application.
>>    Pros: Bitmap is populated with requested page size. This eliminates
>>          the need to re-populate bitmap if requested page size is
>>          different than smallest supported pages size.
>>    Cons: There is one time processing time, when bitmap is queried.
> 
> Another significant Con is that the vpfn list needs to track and manage
> unpinned pages, which makes it more complex and intrusive.  The
> previous option seems to have both time and complexity advantages,
> especially in the case we expect to be most common of the user
> accessing the bitmap with the minimum page size, ie. PAGE_SIZE.  It's
> also not clear why we pre-allocate the bitmap at all with this approach.
> 
>> I prefer later option with simple logic and to eliminate over-head of
>> bitmap repopulation in case of differnt page sizes. Later option is
>> implemented in this patch.
> 
> Hmm, we'll see below, but I not convinced based on the above rationale.
> 
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 287 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index d386461e5d11..df358dc1c85b 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -70,6 +70,7 @@ struct vfio_iommu {
>>   	unsigned int		dma_avail;
>>   	bool			v2;
>>   	bool			nesting;
>> +	bool			dirty_page_tracking;
>>   };
>>   
>>   struct vfio_domain {
>> @@ -90,6 +91,7 @@ struct vfio_dma {
>>   	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>>   	struct task_struct	*task;
>>   	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>> +	unsigned long		*bitmap;
>>   };
>>   
>>   struct vfio_group {
>> @@ -125,6 +127,7 @@ struct vfio_regions {
>>   					(!list_empty(&iommu->domain_list))
>>   
>>   static int put_pfn(unsigned long pfn, int prot);
>> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
>>   
>>   /*
>>    * This code handles mapping and unmapping of user data buffers
>> @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>>   	rb_erase(&old->node, &iommu->dma_list);
>>   }
>>   
>> +static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
>> +{
>> +	if (!npages)
>> +		return 0;
>> +
>> +	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
>> +}
>> +
>> +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
>> +				 struct vfio_dma *dma, unsigned long pgsizes)
>> +{
>> +	unsigned long pgshift = __ffs(pgsizes);
>> +
>> +	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
>> +		unsigned long npages = dma->size >> pgshift;
>> +		unsigned long bsize = dirty_bitmap_bytes(npages);
>> +
>> +		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);
> 
> nit, we don't need to store bsize in a local variable.
> 
>> +		if (!dma->bitmap)
>> +			return -ENOMEM;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
>> +				     unsigned long pgsizes)
>> +{
>> +	struct rb_node *n = rb_first(&iommu->dma_list);
>> +	int ret;
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
>> +		if (ret)
>> +			return ret;
> 
> This doesn't unwind on failure, so we're left with partially allocated
> bitmap cruft.
>

Good point. Adding unwind on failure.

>> +	}
>> +	return 0;
>> +}
>> +
>> +static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
>> +{
>> +	struct rb_node *n = rb_first(&iommu->dma_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		kfree(dma->bitmap);
> 
> We don't set dma->bitmap = NULL and we don't even prevent the case of a
> user making multiple STOP calls, so we have a user triggerable double
> free :(
> 

Ok.

>> +	}
>> +}
>> +
>>   /*
>>    * Helper Functions for host iova-pfn list
>>    */
>> @@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>>   	kfree(vpfn);
>>   }
>>   
>> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
>> +{
>> +	struct rb_node *n = rb_first(&dma->pfn_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
>> +
>> +		if (!vpfn->ref_count)
>> +			vfio_remove_from_pfn_list(dma, vpfn);
>> +	}
>> +}
>> +
>> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
>> +{
>> +	struct rb_node *n = rb_first(&iommu->dma_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		vfio_remove_unpinned_from_pfn_list(dma);
>> +	}
>> +}
>> +
>>   static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>>   					       unsigned long iova)
>>   {
>> @@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>>   	vpfn->ref_count--;
>>   	if (!vpfn->ref_count) {
>>   		ret = put_pfn(vpfn->pfn, dma->prot);
>> -		vfio_remove_from_pfn_list(dma, vpfn);
>> +		if (!dma->bitmap)
>> +			vfio_remove_from_pfn_list(dma, vpfn);
>>   	}
>>   	return ret;
>>   }
>> @@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>>   	return ret;
>>   }
>>   
>> -static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>> +static int vfio_unpin_page_external(struct vfio_iommu *iommu,
> 
> We added a parameter but didn't use it in this patch.
> 

Ok, Moving it to relevant patch.

>> +				    struct vfio_dma *dma, dma_addr_t iova,
>>   				    bool do_accounting)
>>   {
>>   	int unlocked;
>>   	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
>>   
>> -	if (!vpfn)
>> +	if (!vpfn || !vpfn->ref_count)
>>   		return 0;
>>   
>>   	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
>> @@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   	unsigned long remote_vaddr;
>>   	struct vfio_dma *dma;
>>   	bool do_accounting;
>> +	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>>   
>>   	if (!iommu || !user_pfn || !phys_pfn)
>>   		return -EINVAL;
>> @@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   
>>   		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>>   		if (vpfn) {
>> -			phys_pfn[i] = vpfn->pfn;
>> -			continue;
>> +			if (vpfn->ref_count > 1) {
>> +				phys_pfn[i] = vpfn->pfn;
>> +				continue;
>> +			}
>>   		}
>>   
>>   		remote_vaddr = dma->vaddr + iova - dma->iova;
>> @@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   					     do_accounting);
>>   		if (ret)
>>   			goto pin_unwind;
>> -
>> -		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>> -		if (ret) {
>> -			vfio_unpin_page_external(dma, iova, do_accounting);
>> -			goto pin_unwind;
>> +		if (!vpfn) {
>> +			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>> +			if (ret) {
>> +				vfio_unpin_page_external(iommu, dma, iova,
>> +							 do_accounting);
>> +				goto pin_unwind;
>> +			}
>> +		} else
>> +			vpfn->pfn = phys_pfn[i];
>> +
>> +		if (iommu->dirty_page_tracking && !dma->bitmap) {
>> +			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
>> +			if (ret) {
>> +				vfio_unpin_page_external(iommu, dma, iova,
>> +							 do_accounting);
>> +				goto pin_unwind;
>> +			}
>>   		}
>>   	}
>>   
>> @@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   
>>   		iova = user_pfn[j] << PAGE_SHIFT;
>>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>>   		phys_pfn[j] = 0;
>>   	}
>>   pin_done:
>> @@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>>   		if (!dma)
>>   			goto unpin_exit;
>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
>>   	}
>>   
>>   unpin_exit:
>> @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>>   	return bitmap;
>>   }
>>   
>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>> +				  size_t size, uint64_t pgsize,
>> +				  unsigned char __user *bitmap)
>> +{
>> +	struct vfio_dma *dma;
>> +	dma_addr_t i = iova, iova_limit;
>> +	unsigned int bsize, nbits = 0, l = 0;
>> +	unsigned long pgshift = __ffs(pgsize);
>> +
>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
>> +		int ret, j;
>> +		unsigned int npages = 0, shift = 0;
>> +		unsigned char temp = 0;
>> +
>> +		/* mark all pages dirty if all pages are pinned and mapped. */
>> +		if (dma->iommu_mapped) {
>> +			iova_limit = min(dma->iova + dma->size, iova + size);
>> +			npages = iova_limit/pgsize;
>> +			bitmap_set(dma->bitmap, 0, npages);
> 
> npages is derived from iova_limit, which is the number of bits to set
> dirty relative to the first requested iova, not iova zero, ie. the set
> of dirty bits is offset from those requested unless iova == dma->iova.
> 

Right, fixing.

> Also I hope dma->bitmap was actually allocated.  Not only does the
> START error path potentially leave dirty tracking enabled without all
> the bitmap allocated, when does the bitmap get allocated for a new
> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
> vpfn gets marked dirty.
> 

Right.

Fixing error paths.


>> +		} else if (dma->bitmap) {
>> +			struct rb_node *n = rb_first(&dma->pfn_list);
>> +			bool found = false;
>> +
>> +			for (; n; n = rb_next(n)) {
>> +				struct vfio_pfn *vpfn = rb_entry(n,
>> +						struct vfio_pfn, node);
>> +				if (vpfn->iova >= i) {
>> +					found = true;
>> +					break;
>> +				}
>> +			}
>> +
>> +			if (!found) {
>> +				i += dma->size;
>> +				continue;
>> +			}
>> +
>> +			for (; n; n = rb_next(n)) {
>> +				unsigned int s;
>> +				struct vfio_pfn *vpfn = rb_entry(n,
>> +						struct vfio_pfn, node);
>> +
>> +				if (vpfn->iova >= iova + size)
>> +					break;
>> +
>> +				s = (vpfn->iova - dma->iova) >> pgshift;
>> +				bitmap_set(dma->bitmap, s, 1);
>> +
>> +				iova_limit = vpfn->iova + pgsize;
>> +			}
>> +			npages = iova_limit/pgsize;
> 
> Isn't iova_limit potentially uninitialized here?  For example, if our
> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
> 4096 and break, and npages = ????/pgsize.
> 

Right, Fixing it.

>> +		}
>> +
>> +		bsize = dirty_bitmap_bytes(npages);
>> +		shift = nbits % BITS_PER_BYTE;
>> +
>> +		if (npages && shift) {
>> +			l--;
>> +			if (!access_ok((void __user *)bitmap + l,
>> +					sizeof(unsigned char)))
>> +				return -EINVAL;
>> +
>> +			ret = __get_user(temp, bitmap + l);
> 
> I don't understand why we care to get the user's bitmap, are we trying
> to leave whatever garbage they might have set in it and only also set
> the dirty bits?  That seems unnecessary.
> 

Suppose dma mapped ranges are {start, size}:
{0, 0xa000}, {0xa000, 0x10000}

Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
Then in first iteration for dma {0,0xa000} there are 10 pages, so 10 
bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
In second iteration for dma {0xa000, 0x10000} there are 6 pages and 
these bits should be appended to previous byte. So get_user() that byte, 
then shift-OR rest of the bitmap, result should be: (11111111 11111111b)

Without get_user() and shift-OR, resulting bitmap would be
111111 00000011 11111111b which would be wrong.

> Also why do we need these access_ok() checks when we already checked
> the range at the start of the ioctl?

Since pointer is updated runtime here, better to check that pointer 
before using that pointer.

> 
>> +			if (ret)
>> +				return ret;
>> +		}
>> +
>> +		for (j = 0; j < bsize; j++, l++) {
>> +			temp = temp |
>> +			       (*((unsigned char *)dma->bitmap + j) << shift);
> 
> |=
> 
>> +			if (!access_ok((void __user *)bitmap + l,
>> +					sizeof(unsigned char)))
>> +				return -EINVAL;
>> +
>> +			ret = __put_user(temp, bitmap + l);
>> +			if (ret)
>> +				return ret;
>> +			if (shift) {
>> +				temp = *((unsigned char *)dma->bitmap + j) >>
>> +					(BITS_PER_BYTE - shift);
>> +			}
> 
> When shift == 0, temp just seems to accumulate bits that never get
> cleared.
> 

Hope example above explains the shift logic.

>> +		}
>> +
>> +		nbits += npages;
>> +
>> +		i = min(dma->iova + dma->size, iova + size);
>> +		if (i >= iova + size)
>> +			break;
> 
> So whether we error or succeed, we leave cruft in dma->bitmap for the
> next pass.  It doesn't seem to make any sense why we pre-allocated the
> bitmap, we might as well just allocate it on demand here.  Actually, if
> we're not going to do a copy_to_user() for some range of the bitmap,
> I'm not sure what it's purpose is at all.  I think the big advantages
> of the bitmap are that we can't amortize the cost across every pinned
> page or DMA mapping, we don't need the overhead of tracking unmapped
> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
> getting any of those advantages here.
> 

That would still not work if dma range size is not multiples of 8 pages. 
See example above.


>> +	}
>> +	return 0;
>> +}
>> +
>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>> +{
>> +	long bsize;
>> +
>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
>> +		return -EINVAL;
>> +
>> +	bsize = dirty_bitmap_bytes(npages);
>> +
>> +	if (bitmap_size < bsize)
>> +		return -EINVAL;
>> +
>> +	return bsize;
>> +}
> 
> Seems like this could simply return int, -errno or zero for success.
> The returned bsize is not used for anything else.
> 

ok.

>> +
>>   static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>   			     struct vfio_iommu_type1_dma_unmap *unmap)
>>   {
>> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>   
>>   		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>   			-EFAULT : 0;
>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>> +		struct vfio_iommu_type1_dirty_bitmap range;
>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>> +		int ret;
>> +
>> +		if (!iommu->v2)
>> +			return -EACCES;
>> +
>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>> +				    bitmap);
> 
> We require the user to provide iova, size, pgsize, bitmap_size, and
> bitmap fields to START/STOP?  Why?
>

No. But those are part of structure.

>> +
>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (range.argsz < minsz || range.flags & ~mask)
>> +			return -EINVAL;
>> +
>> +		/* only one flag should be set at a time */
>> +		if (__ffs(range.flags) != __fls(range.flags))
>> +			return -EINVAL;
>> +
>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>> +
>> +			mutex_lock(&iommu->lock);
>> +			iommu->dirty_page_tracking = true;
>> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);
> 
> So dirty page tracking is enabled even if we fail to allocate all the
> bitmaps?  Shouldn't this return an error if dirty tracking is already
> enabled?
> 

Adding error handling here in next patch.

>> +			mutex_unlock(&iommu->lock);
>> +			return ret;
>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>> +			mutex_lock(&iommu->lock);
>> +			iommu->dirty_page_tracking = false;
> 
> Shouldn't we only allow STOP if tracking is enabled?
> 

Right,adding.

>> +			vfio_dma_all_bitmap_free(iommu);
> 
> Here's where that user induced double free enters the picture.
> 

Error handling as mentioned above will prevent double free.

Thanks,
Kirti

>> +			vfio_remove_unpinned_from_dma_list(iommu);
>> +			mutex_unlock(&iommu->lock);
>> +			return 0;
>> +		} else if (range.flags &
>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>> +			long bsize;
>> +			unsigned long pgshift = __ffs(range.pgsize);
>> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>> +			uint64_t iommu_pgmask =
>> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
>> +
>> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
>> +				return -EINVAL;
>> +			if (range.iova & iommu_pgmask)
>> +				return -EINVAL;
>> +			if (!range.size || range.size & iommu_pgmask)
>> +				return -EINVAL;
>> +			if (range.iova + range.size < range.iova)
>> +				return -EINVAL;
>> +			if (!access_ok((void __user *)range.bitmap,
>> +				       range.bitmap_size))
>> +				return -EINVAL;
>> +
>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>> +						   range.bitmap_size);
>> +			if (bsize < 0)
>> +				return bsize;
>> +
>> +			mutex_lock(&iommu->lock);
>> +			if (iommu->dirty_page_tracking)
>> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
>> +					 range.size, range.pgsize,
>> +					 (unsigned char __user *)range.bitmap);
>> +			else
>> +				ret = -EINVAL;
>> +			mutex_unlock(&iommu->lock);
>> +
>> +			return ret;
>> +		}
>>   	}
>>   
>>   	return -ENOTTY;
> 
> Thanks,
> Alex
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-12 20:56       ` Kirti Wankhede
@ 2020-02-12 23:13         ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-12 23:13 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Thu, 13 Feb 2020 02:26:23 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 2/10/2020 10:55 PM, Alex Williamson wrote:
> > On Sat, 8 Feb 2020 01:12:31 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> >> - Start pinned and unpinned pages tracking while migration is active
> >> - Stop pinned and unpinned dirty pages tracking. This is also used to
> >>    stop dirty pages tracking if migration failed or cancelled.
> >> - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> >>    user space application responsibility to copy content of dirty pages
> >>    from source to destination during migration.
> >>
> >> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> >> structure. Bitmap size is calculated considering smallest supported page
> >> size. Bitmap is allocated when dirty logging is enabled for those
> >> vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> >> case of pass-through device.
> >>
> >> There could be multiple option as to when bitmap should be populated:
> >> * Polulate bitmap for already pinned pages when bitmap is allocated for
> >>    a vfio_dma with the smallest supported page size. Updates bitmap from
> >>    page pinning and unpinning functions. When user application queries
> >>    bitmap, check if requested page size is same as page size used to
> >>    populated bitmap. If it is equal, copy bitmap. But if not equal,
> >>    re-populated bitmap according to requested page size and then copy to
> >>    user.
> >>    Pros: Bitmap gets populated on the fly after dirty tracking has
> >>          started.
> >>    Cons: If requested page size is different than smallest supported
> >>          page size, then bitmap has to be re-populated again, with
> >>          additional overhead of allocating bitmap memory again for
> >>          re-population of bitmap.  
> > 
> > No memory needs to be allocated to re-populate the bitmap.  The bitmap
> > is clear-on-read and by tracking the bitmap in the smallest supported
> > page size we can guarantee that we can fit the user requested bitmap
> > size within the space occupied by that minimal page size range of the
> > bitmap.  Therefore we'd destructively translate the requested region of
> > the bitmap to a different page size, write it out to the user, and
> > clear it.  Also we expect userspace to use the minimum page size almost
> > exclusively, which is optimized by this approach as dirty bit tracking
> > is spread out over each page pinning operation.
> >   
> >>
> >> * Populate bitmap when bitmap is queried by user application.
> >>    Pros: Bitmap is populated with requested page size. This eliminates
> >>          the need to re-populate bitmap if requested page size is
> >>          different than smallest supported pages size.
> >>    Cons: There is one time processing time, when bitmap is queried.  
> > 
> > Another significant Con is that the vpfn list needs to track and manage
> > unpinned pages, which makes it more complex and intrusive.  The
> > previous option seems to have both time and complexity advantages,
> > especially in the case we expect to be most common of the user
> > accessing the bitmap with the minimum page size, ie. PAGE_SIZE.  It's
> > also not clear why we pre-allocate the bitmap at all with this approach.
> >   
> >> I prefer later option with simple logic and to eliminate over-head of
> >> bitmap repopulation in case of differnt page sizes. Later option is
> >> implemented in this patch.  
> > 
> > Hmm, we'll see below, but I not convinced based on the above rationale.
> >   
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> >>   1 file changed, 287 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index d386461e5d11..df358dc1c85b 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -70,6 +70,7 @@ struct vfio_iommu {
> >>   	unsigned int		dma_avail;
> >>   	bool			v2;
> >>   	bool			nesting;
> >> +	bool			dirty_page_tracking;
> >>   };
> >>   
> >>   struct vfio_domain {
> >> @@ -90,6 +91,7 @@ struct vfio_dma {
> >>   	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> >>   	struct task_struct	*task;
> >>   	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >> +	unsigned long		*bitmap;
> >>   };
> >>   
> >>   struct vfio_group {
> >> @@ -125,6 +127,7 @@ struct vfio_regions {
> >>   					(!list_empty(&iommu->domain_list))
> >>   
> >>   static int put_pfn(unsigned long pfn, int prot);
> >> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
> >>   
> >>   /*
> >>    * This code handles mapping and unmapping of user data buffers
> >> @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
> >>   	rb_erase(&old->node, &iommu->dma_list);
> >>   }
> >>   
> >> +static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
> >> +{
> >> +	if (!npages)
> >> +		return 0;
> >> +
> >> +	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> >> +}
> >> +
> >> +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
> >> +				 struct vfio_dma *dma, unsigned long pgsizes)
> >> +{
> >> +	unsigned long pgshift = __ffs(pgsizes);
> >> +
> >> +	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
> >> +		unsigned long npages = dma->size >> pgshift;
> >> +		unsigned long bsize = dirty_bitmap_bytes(npages);
> >> +
> >> +		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);  
> > 
> > nit, we don't need to store bsize in a local variable.
> >   
> >> +		if (!dma->bitmap)
> >> +			return -ENOMEM;
> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
> >> +				     unsigned long pgsizes)
> >> +{
> >> +	struct rb_node *n = rb_first(&iommu->dma_list);
> >> +	int ret;
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >> +
> >> +		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
> >> +		if (ret)
> >> +			return ret;  
> > 
> > This doesn't unwind on failure, so we're left with partially allocated
> > bitmap cruft.
> >  
> 
> Good point. Adding unwind on failure.
> 
> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
> >> +{
> >> +	struct rb_node *n = rb_first(&iommu->dma_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >> +
> >> +		kfree(dma->bitmap);  
> > 
> > We don't set dma->bitmap = NULL and we don't even prevent the case of a
> > user making multiple STOP calls, so we have a user triggerable double
> > free :(
> >   
> 
> Ok.
> 
> >> +	}
> >> +}
> >> +
> >>   /*
> >>    * Helper Functions for host iova-pfn list
> >>    */
> >> @@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
> >>   	kfree(vpfn);
> >>   }
> >>   
> >> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
> >> +{
> >> +	struct rb_node *n = rb_first(&dma->pfn_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> >> +
> >> +		if (!vpfn->ref_count)
> >> +			vfio_remove_from_pfn_list(dma, vpfn);
> >> +	}
> >> +}
> >> +
> >> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> >> +{
> >> +	struct rb_node *n = rb_first(&iommu->dma_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >> +
> >> +		vfio_remove_unpinned_from_pfn_list(dma);
> >> +	}
> >> +}
> >> +
> >>   static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> >>   					       unsigned long iova)
> >>   {
> >> @@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> >>   	vpfn->ref_count--;
> >>   	if (!vpfn->ref_count) {
> >>   		ret = put_pfn(vpfn->pfn, dma->prot);
> >> -		vfio_remove_from_pfn_list(dma, vpfn);
> >> +		if (!dma->bitmap)
> >> +			vfio_remove_from_pfn_list(dma, vpfn);
> >>   	}
> >>   	return ret;
> >>   }
> >> @@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
> >>   	return ret;
> >>   }
> >>   
> >> -static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> >> +static int vfio_unpin_page_external(struct vfio_iommu *iommu,  
> > 
> > We added a parameter but didn't use it in this patch.
> >   
> 
> Ok, Moving it to relevant patch.
> 
> >> +				    struct vfio_dma *dma, dma_addr_t iova,
> >>   				    bool do_accounting)
> >>   {
> >>   	int unlocked;
> >>   	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
> >>   
> >> -	if (!vpfn)
> >> +	if (!vpfn || !vpfn->ref_count)
> >>   		return 0;
> >>   
> >>   	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> >> @@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   	unsigned long remote_vaddr;
> >>   	struct vfio_dma *dma;
> >>   	bool do_accounting;
> >> +	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> >>   
> >>   	if (!iommu || !user_pfn || !phys_pfn)
> >>   		return -EINVAL;
> >> @@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   
> >>   		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>   		if (vpfn) {
> >> -			phys_pfn[i] = vpfn->pfn;
> >> -			continue;
> >> +			if (vpfn->ref_count > 1) {
> >> +				phys_pfn[i] = vpfn->pfn;
> >> +				continue;
> >> +			}
> >>   		}
> >>   
> >>   		remote_vaddr = dma->vaddr + iova - dma->iova;
> >> @@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   					     do_accounting);
> >>   		if (ret)
> >>   			goto pin_unwind;
> >> -
> >> -		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >> -		if (ret) {
> >> -			vfio_unpin_page_external(dma, iova, do_accounting);
> >> -			goto pin_unwind;
> >> +		if (!vpfn) {
> >> +			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >> +			if (ret) {
> >> +				vfio_unpin_page_external(iommu, dma, iova,
> >> +							 do_accounting);
> >> +				goto pin_unwind;
> >> +			}
> >> +		} else
> >> +			vpfn->pfn = phys_pfn[i];
> >> +
> >> +		if (iommu->dirty_page_tracking && !dma->bitmap) {
> >> +			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
> >> +			if (ret) {
> >> +				vfio_unpin_page_external(iommu, dma, iova,
> >> +							 do_accounting);
> >> +				goto pin_unwind;
> >> +			}
> >>   		}
> >>   	}
> >>   
> >> @@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   
> >>   		iova = user_pfn[j] << PAGE_SHIFT;
> >>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
> >>   		phys_pfn[j] = 0;
> >>   	}
> >>   pin_done:
> >> @@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
> >>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >>   		if (!dma)
> >>   			goto unpin_exit;
> >> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
> >>   	}
> >>   
> >>   unpin_exit:
> >> @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >>   	return bitmap;
> >>   }
> >>   
> >> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> >> +				  size_t size, uint64_t pgsize,
> >> +				  unsigned char __user *bitmap)
> >> +{
> >> +	struct vfio_dma *dma;
> >> +	dma_addr_t i = iova, iova_limit;
> >> +	unsigned int bsize, nbits = 0, l = 0;
> >> +	unsigned long pgshift = __ffs(pgsize);
> >> +
> >> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> >> +		int ret, j;
> >> +		unsigned int npages = 0, shift = 0;
> >> +		unsigned char temp = 0;
> >> +
> >> +		/* mark all pages dirty if all pages are pinned and mapped. */
> >> +		if (dma->iommu_mapped) {
> >> +			iova_limit = min(dma->iova + dma->size, iova + size);
> >> +			npages = iova_limit/pgsize;
> >> +			bitmap_set(dma->bitmap, 0, npages);  
> > 
> > npages is derived from iova_limit, which is the number of bits to set
> > dirty relative to the first requested iova, not iova zero, ie. the set
> > of dirty bits is offset from those requested unless iova == dma->iova.
> >   
> 
> Right, fixing.
> 
> > Also I hope dma->bitmap was actually allocated.  Not only does the
> > START error path potentially leave dirty tracking enabled without all
> > the bitmap allocated, when does the bitmap get allocated for a new
> > vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
> > vpfn gets marked dirty.
> >   
> 
> Right.
> 
> Fixing error paths.
> 
> 
> >> +		} else if (dma->bitmap) {
> >> +			struct rb_node *n = rb_first(&dma->pfn_list);
> >> +			bool found = false;
> >> +
> >> +			for (; n; n = rb_next(n)) {
> >> +				struct vfio_pfn *vpfn = rb_entry(n,
> >> +						struct vfio_pfn, node);
> >> +				if (vpfn->iova >= i) {
> >> +					found = true;
> >> +					break;
> >> +				}
> >> +			}
> >> +
> >> +			if (!found) {
> >> +				i += dma->size;
> >> +				continue;
> >> +			}
> >> +
> >> +			for (; n; n = rb_next(n)) {
> >> +				unsigned int s;
> >> +				struct vfio_pfn *vpfn = rb_entry(n,
> >> +						struct vfio_pfn, node);
> >> +
> >> +				if (vpfn->iova >= iova + size)
> >> +					break;
> >> +
> >> +				s = (vpfn->iova - dma->iova) >> pgshift;
> >> +				bitmap_set(dma->bitmap, s, 1);
> >> +
> >> +				iova_limit = vpfn->iova + pgsize;
> >> +			}
> >> +			npages = iova_limit/pgsize;  
> > 
> > Isn't iova_limit potentially uninitialized here?  For example, if our
> > vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
> > there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
> > (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
> > 4096 and break, and npages = ????/pgsize.
> >   
> 
> Right, Fixing it.
> 
> >> +		}
> >> +
> >> +		bsize = dirty_bitmap_bytes(npages);
> >> +		shift = nbits % BITS_PER_BYTE;
> >> +
> >> +		if (npages && shift) {
> >> +			l--;
> >> +			if (!access_ok((void __user *)bitmap + l,
> >> +					sizeof(unsigned char)))
> >> +				return -EINVAL;
> >> +
> >> +			ret = __get_user(temp, bitmap + l);  
> > 
> > I don't understand why we care to get the user's bitmap, are we trying
> > to leave whatever garbage they might have set in it and only also set
> > the dirty bits?  That seems unnecessary.
> >   
> 
> Suppose dma mapped ranges are {start, size}:
> {0, 0xa000}, {0xa000, 0x10000}
> 
> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10 
> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
> In second iteration for dma {0xa000, 0x10000} there are 6 pages and 
> these bits should be appended to previous byte. So get_user() that byte, 
> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
> 
> Without get_user() and shift-OR, resulting bitmap would be
> 111111 00000011 11111111b which would be wrong.

Seems like if we use a put_user() approach then we should look for
adjacent vfio_dmas within the same byte/word/dword before we push it to
the user to avoid this sort of inefficiency.

> > Also why do we need these access_ok() checks when we already checked
> > the range at the start of the ioctl?  
> 
> Since pointer is updated runtime here, better to check that pointer 
> before using that pointer.

Sorry, I still don't understand this, we check access_ok() with a
pointer and a length, therefore as long as we're incrementing the
pointer within that length, why do we need to retest?

> >   
> >> +			if (ret)
> >> +				return ret;
> >> +		}
> >> +
> >> +		for (j = 0; j < bsize; j++, l++) {
> >> +			temp = temp |
> >> +			       (*((unsigned char *)dma->bitmap + j) << shift);  
> > 
> > |=
> >   
> >> +			if (!access_ok((void __user *)bitmap + l,
> >> +					sizeof(unsigned char)))
> >> +				return -EINVAL;
> >> +
> >> +			ret = __put_user(temp, bitmap + l);
> >> +			if (ret)
> >> +				return ret;
> >> +			if (shift) {
> >> +				temp = *((unsigned char *)dma->bitmap + j) >>
> >> +					(BITS_PER_BYTE - shift);
> >> +			}  
> > 
> > When shift == 0, temp just seems to accumulate bits that never get
> > cleared.
> >   
> 
> Hope example above explains the shift logic.

But that example is when shift is non-zero.  When shift is zero, each
iteration of the loop just ORs in new bits to temp without ever
clearing the bits for the previous iteration.


> >> +		}
> >> +
> >> +		nbits += npages;
> >> +
> >> +		i = min(dma->iova + dma->size, iova + size);
> >> +		if (i >= iova + size)
> >> +			break;  
> > 
> > So whether we error or succeed, we leave cruft in dma->bitmap for the
> > next pass.  It doesn't seem to make any sense why we pre-allocated the
> > bitmap, we might as well just allocate it on demand here.  Actually, if
> > we're not going to do a copy_to_user() for some range of the bitmap,
> > I'm not sure what it's purpose is at all.  I think the big advantages
> > of the bitmap are that we can't amortize the cost across every pinned
> > page or DMA mapping, we don't need the overhead of tracking unmapped
> > vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
> > getting any of those advantages here.
> >   
> 
> That would still not work if dma range size is not multiples of 8 pages. 
> See example above.

I don't understand this comment, what about the example above justifies
the bitmap?  As I understand the above algorithm, we find a vfio_dma
overlapping the request and populate the bitmap for that range.  Then
we go back and put_user() for each byte that we touched.  We could
instead simply work on a one byte buffer as we enumerate the requested
range and do a put_user() ever time we reach the end of it and have bits
set.  That would greatly simplify the above example.  But I would expect
that we're a) more likely to get asked for ranges covering a single
vfio_dma and b) we're going to spend far more time operating in the
middle of the range and limiting ourselves to one-byte operations there
seems absurd.  If we want to specify that the user provides 4-byte
aligned buffers and naturally aligned iova ranges to make our lives
easier in the kernel, now would be the time to do that.

> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> >> +{
> >> +	long bsize;
> >> +
> >> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> >> +		return -EINVAL;
> >> +
> >> +	bsize = dirty_bitmap_bytes(npages);
> >> +
> >> +	if (bitmap_size < bsize)
> >> +		return -EINVAL;
> >> +
> >> +	return bsize;
> >> +}  
> > 
> > Seems like this could simply return int, -errno or zero for success.
> > The returned bsize is not used for anything else.
> >   
> 
> ok.
> 
> >> +
> >>   static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>   			     struct vfio_iommu_type1_dma_unmap *unmap)
> >>   {
> >> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>   
> >>   		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>   			-EFAULT : 0;
> >> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> >> +		struct vfio_iommu_type1_dirty_bitmap range;
> >> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> >> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> >> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> >> +		int ret;
> >> +
> >> +		if (!iommu->v2)
> >> +			return -EACCES;
> >> +
> >> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> >> +				    bitmap);  
> > 
> > We require the user to provide iova, size, pgsize, bitmap_size, and
> > bitmap fields to START/STOP?  Why?
> >  
> 
> No. But those are part of structure.

But we do require it, minsz here includes all those fields, which would
probably make a user scratch their head wondering why they need to pass
irrelevant data for START/STOP.  It almost implies that we support
starting and stopping dirty logging for specific ranges of the IOVA
space.  We could define the structure, for example:

struct vfio_iommu_type1_dirty_bitmap {
	__u32	argsz;
	__u32	flags;
	__u8	data[];
};

struct vfio_iommu_type1_dirty_bitmap_get {
	__u64	iova;
	__u64	size;
	__u64	pgsize;
	__u64	bitmap_size;
	void __user *bitmap;
};

Where data[] is defined as the latter structure when FLAG_GET_BITMAP is
specified.  BTW, don't we need to specify the trailing void* as __u64?
We could theoretically be talking to an ILP32 user process.  Thanks,

Alex

> >> +
> >> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (range.argsz < minsz || range.flags & ~mask)
> >> +			return -EINVAL;
> >> +
> >> +		/* only one flag should be set at a time */
> >> +		if (__ffs(range.flags) != __fls(range.flags))
> >> +			return -EINVAL;
> >> +
> >> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> >> +
> >> +			mutex_lock(&iommu->lock);
> >> +			iommu->dirty_page_tracking = true;
> >> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);  
> > 
> > So dirty page tracking is enabled even if we fail to allocate all the
> > bitmaps?  Shouldn't this return an error if dirty tracking is already
> > enabled?
> >   
> 
> Adding error handling here in next patch.
> 
> >> +			mutex_unlock(&iommu->lock);
> >> +			return ret;
> >> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >> +			mutex_lock(&iommu->lock);
> >> +			iommu->dirty_page_tracking = false;  
> > 
> > Shouldn't we only allow STOP if tracking is enabled?
> >   
> 
> Right,adding.
> 
> >> +			vfio_dma_all_bitmap_free(iommu);  
> > 
> > Here's where that user induced double free enters the picture.
> >   
> 
> Error handling as mentioned above will prevent double free.
> 
> Thanks,
> Kirti
> 
> >> +			vfio_remove_unpinned_from_dma_list(iommu);
> >> +			mutex_unlock(&iommu->lock);
> >> +			return 0;
> >> +		} else if (range.flags &
> >> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >> +			long bsize;
> >> +			unsigned long pgshift = __ffs(range.pgsize);
> >> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> >> +			uint64_t iommu_pgmask =
> >> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
> >> +
> >> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
> >> +				return -EINVAL;
> >> +			if (range.iova & iommu_pgmask)
> >> +				return -EINVAL;
> >> +			if (!range.size || range.size & iommu_pgmask)
> >> +				return -EINVAL;
> >> +			if (range.iova + range.size < range.iova)
> >> +				return -EINVAL;
> >> +			if (!access_ok((void __user *)range.bitmap,
> >> +				       range.bitmap_size))
> >> +				return -EINVAL;
> >> +
> >> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >> +						   range.bitmap_size);
> >> +			if (bsize < 0)
> >> +				return bsize;
> >> +
> >> +			mutex_lock(&iommu->lock);
> >> +			if (iommu->dirty_page_tracking)
> >> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
> >> +					 range.size, range.pgsize,
> >> +					 (unsigned char __user *)range.bitmap);
> >> +			else
> >> +				ret = -EINVAL;
> >> +			mutex_unlock(&iommu->lock);
> >> +
> >> +			return ret;
> >> +		}
> >>   	}
> >>   
> >>   	return -ENOTTY;  
> > 
> > Thanks,
> > Alex
> >   
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-12 23:13         ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-12 23:13 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Thu, 13 Feb 2020 02:26:23 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 2/10/2020 10:55 PM, Alex Williamson wrote:
> > On Sat, 8 Feb 2020 01:12:31 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> >> - Start pinned and unpinned pages tracking while migration is active
> >> - Stop pinned and unpinned dirty pages tracking. This is also used to
> >>    stop dirty pages tracking if migration failed or cancelled.
> >> - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its
> >>    user space application responsibility to copy content of dirty pages
> >>    from source to destination during migration.
> >>
> >> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> >> structure. Bitmap size is calculated considering smallest supported page
> >> size. Bitmap is allocated when dirty logging is enabled for those
> >> vfio_dmas whose vpfn list is not empty or whole range is mapped, in
> >> case of pass-through device.
> >>
> >> There could be multiple option as to when bitmap should be populated:
> >> * Polulate bitmap for already pinned pages when bitmap is allocated for
> >>    a vfio_dma with the smallest supported page size. Updates bitmap from
> >>    page pinning and unpinning functions. When user application queries
> >>    bitmap, check if requested page size is same as page size used to
> >>    populated bitmap. If it is equal, copy bitmap. But if not equal,
> >>    re-populated bitmap according to requested page size and then copy to
> >>    user.
> >>    Pros: Bitmap gets populated on the fly after dirty tracking has
> >>          started.
> >>    Cons: If requested page size is different than smallest supported
> >>          page size, then bitmap has to be re-populated again, with
> >>          additional overhead of allocating bitmap memory again for
> >>          re-population of bitmap.  
> > 
> > No memory needs to be allocated to re-populate the bitmap.  The bitmap
> > is clear-on-read and by tracking the bitmap in the smallest supported
> > page size we can guarantee that we can fit the user requested bitmap
> > size within the space occupied by that minimal page size range of the
> > bitmap.  Therefore we'd destructively translate the requested region of
> > the bitmap to a different page size, write it out to the user, and
> > clear it.  Also we expect userspace to use the minimum page size almost
> > exclusively, which is optimized by this approach as dirty bit tracking
> > is spread out over each page pinning operation.
> >   
> >>
> >> * Populate bitmap when bitmap is queried by user application.
> >>    Pros: Bitmap is populated with requested page size. This eliminates
> >>          the need to re-populate bitmap if requested page size is
> >>          different than smallest supported pages size.
> >>    Cons: There is one time processing time, when bitmap is queried.  
> > 
> > Another significant Con is that the vpfn list needs to track and manage
> > unpinned pages, which makes it more complex and intrusive.  The
> > previous option seems to have both time and complexity advantages,
> > especially in the case we expect to be most common of the user
> > accessing the bitmap with the minimum page size, ie. PAGE_SIZE.  It's
> > also not clear why we pre-allocate the bitmap at all with this approach.
> >   
> >> I prefer later option with simple logic and to eliminate over-head of
> >> bitmap repopulation in case of differnt page sizes. Later option is
> >> implemented in this patch.  
> > 
> > Hmm, we'll see below, but I not convinced based on the above rationale.
> >   
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   drivers/vfio/vfio_iommu_type1.c | 299 ++++++++++++++++++++++++++++++++++++++--
> >>   1 file changed, 287 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index d386461e5d11..df358dc1c85b 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -70,6 +70,7 @@ struct vfio_iommu {
> >>   	unsigned int		dma_avail;
> >>   	bool			v2;
> >>   	bool			nesting;
> >> +	bool			dirty_page_tracking;
> >>   };
> >>   
> >>   struct vfio_domain {
> >> @@ -90,6 +91,7 @@ struct vfio_dma {
> >>   	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> >>   	struct task_struct	*task;
> >>   	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >> +	unsigned long		*bitmap;
> >>   };
> >>   
> >>   struct vfio_group {
> >> @@ -125,6 +127,7 @@ struct vfio_regions {
> >>   					(!list_empty(&iommu->domain_list))
> >>   
> >>   static int put_pfn(unsigned long pfn, int prot);
> >> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
> >>   
> >>   /*
> >>    * This code handles mapping and unmapping of user data buffers
> >> @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
> >>   	rb_erase(&old->node, &iommu->dma_list);
> >>   }
> >>   
> >> +static inline unsigned long dirty_bitmap_bytes(unsigned int npages)
> >> +{
> >> +	if (!npages)
> >> +		return 0;
> >> +
> >> +	return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> >> +}
> >> +
> >> +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu,
> >> +				 struct vfio_dma *dma, unsigned long pgsizes)
> >> +{
> >> +	unsigned long pgshift = __ffs(pgsizes);
> >> +
> >> +	if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) {
> >> +		unsigned long npages = dma->size >> pgshift;
> >> +		unsigned long bsize = dirty_bitmap_bytes(npages);
> >> +
> >> +		dma->bitmap = kvzalloc(bsize, GFP_KERNEL);  
> > 
> > nit, we don't need to store bsize in a local variable.
> >   
> >> +		if (!dma->bitmap)
> >> +			return -ENOMEM;
> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu,
> >> +				     unsigned long pgsizes)
> >> +{
> >> +	struct rb_node *n = rb_first(&iommu->dma_list);
> >> +	int ret;
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >> +
> >> +		ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes);
> >> +		if (ret)
> >> +			return ret;  
> > 
> > This doesn't unwind on failure, so we're left with partially allocated
> > bitmap cruft.
> >  
> 
> Good point. Adding unwind on failure.
> 
> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu)
> >> +{
> >> +	struct rb_node *n = rb_first(&iommu->dma_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >> +
> >> +		kfree(dma->bitmap);  
> > 
> > We don't set dma->bitmap = NULL and we don't even prevent the case of a
> > user making multiple STOP calls, so we have a user triggerable double
> > free :(
> >   
> 
> Ok.
> 
> >> +	}
> >> +}
> >> +
> >>   /*
> >>    * Helper Functions for host iova-pfn list
> >>    */
> >> @@ -244,6 +298,29 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
> >>   	kfree(vpfn);
> >>   }
> >>   
> >> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma)
> >> +{
> >> +	struct rb_node *n = rb_first(&dma->pfn_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> >> +
> >> +		if (!vpfn->ref_count)
> >> +			vfio_remove_from_pfn_list(dma, vpfn);
> >> +	}
> >> +}
> >> +
> >> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> >> +{
> >> +	struct rb_node *n = rb_first(&iommu->dma_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >> +
> >> +		vfio_remove_unpinned_from_pfn_list(dma);
> >> +	}
> >> +}
> >> +
> >>   static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> >>   					       unsigned long iova)
> >>   {
> >> @@ -261,7 +338,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> >>   	vpfn->ref_count--;
> >>   	if (!vpfn->ref_count) {
> >>   		ret = put_pfn(vpfn->pfn, dma->prot);
> >> -		vfio_remove_from_pfn_list(dma, vpfn);
> >> +		if (!dma->bitmap)
> >> +			vfio_remove_from_pfn_list(dma, vpfn);
> >>   	}
> >>   	return ret;
> >>   }
> >> @@ -483,13 +561,14 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
> >>   	return ret;
> >>   }
> >>   
> >> -static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> >> +static int vfio_unpin_page_external(struct vfio_iommu *iommu,  
> > 
> > We added a parameter but didn't use it in this patch.
> >   
> 
> Ok, Moving it to relevant patch.
> 
> >> +				    struct vfio_dma *dma, dma_addr_t iova,
> >>   				    bool do_accounting)
> >>   {
> >>   	int unlocked;
> >>   	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
> >>   
> >> -	if (!vpfn)
> >> +	if (!vpfn || !vpfn->ref_count)
> >>   		return 0;
> >>   
> >>   	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> >> @@ -510,6 +589,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   	unsigned long remote_vaddr;
> >>   	struct vfio_dma *dma;
> >>   	bool do_accounting;
> >> +	unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> >>   
> >>   	if (!iommu || !user_pfn || !phys_pfn)
> >>   		return -EINVAL;
> >> @@ -551,8 +631,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   
> >>   		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>   		if (vpfn) {
> >> -			phys_pfn[i] = vpfn->pfn;
> >> -			continue;
> >> +			if (vpfn->ref_count > 1) {
> >> +				phys_pfn[i] = vpfn->pfn;
> >> +				continue;
> >> +			}
> >>   		}
> >>   
> >>   		remote_vaddr = dma->vaddr + iova - dma->iova;
> >> @@ -560,11 +642,23 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   					     do_accounting);
> >>   		if (ret)
> >>   			goto pin_unwind;
> >> -
> >> -		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >> -		if (ret) {
> >> -			vfio_unpin_page_external(dma, iova, do_accounting);
> >> -			goto pin_unwind;
> >> +		if (!vpfn) {
> >> +			ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >> +			if (ret) {
> >> +				vfio_unpin_page_external(iommu, dma, iova,
> >> +							 do_accounting);
> >> +				goto pin_unwind;
> >> +			}
> >> +		} else
> >> +			vpfn->pfn = phys_pfn[i];
> >> +
> >> +		if (iommu->dirty_page_tracking && !dma->bitmap) {
> >> +			ret = vfio_dma_bitmap_alloc(iommu, dma, iommu_pgsizes);
> >> +			if (ret) {
> >> +				vfio_unpin_page_external(iommu, dma, iova,
> >> +							 do_accounting);
> >> +				goto pin_unwind;
> >> +			}
> >>   		}
> >>   	}
> >>   
> >> @@ -578,7 +672,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   
> >>   		iova = user_pfn[j] << PAGE_SHIFT;
> >>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
> >>   		phys_pfn[j] = 0;
> >>   	}
> >>   pin_done:
> >> @@ -612,7 +706,7 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
> >>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >>   		if (!dma)
> >>   			goto unpin_exit;
> >> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >> +		vfio_unpin_page_external(iommu, dma, iova, do_accounting);
> >>   	}
> >>   
> >>   unpin_exit:
> >> @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >>   	return bitmap;
> >>   }
> >>   
> >> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> >> +				  size_t size, uint64_t pgsize,
> >> +				  unsigned char __user *bitmap)
> >> +{
> >> +	struct vfio_dma *dma;
> >> +	dma_addr_t i = iova, iova_limit;
> >> +	unsigned int bsize, nbits = 0, l = 0;
> >> +	unsigned long pgshift = __ffs(pgsize);
> >> +
> >> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> >> +		int ret, j;
> >> +		unsigned int npages = 0, shift = 0;
> >> +		unsigned char temp = 0;
> >> +
> >> +		/* mark all pages dirty if all pages are pinned and mapped. */
> >> +		if (dma->iommu_mapped) {
> >> +			iova_limit = min(dma->iova + dma->size, iova + size);
> >> +			npages = iova_limit/pgsize;
> >> +			bitmap_set(dma->bitmap, 0, npages);  
> > 
> > npages is derived from iova_limit, which is the number of bits to set
> > dirty relative to the first requested iova, not iova zero, ie. the set
> > of dirty bits is offset from those requested unless iova == dma->iova.
> >   
> 
> Right, fixing.
> 
> > Also I hope dma->bitmap was actually allocated.  Not only does the
> > START error path potentially leave dirty tracking enabled without all
> > the bitmap allocated, when does the bitmap get allocated for a new
> > vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
> > vpfn gets marked dirty.
> >   
> 
> Right.
> 
> Fixing error paths.
> 
> 
> >> +		} else if (dma->bitmap) {
> >> +			struct rb_node *n = rb_first(&dma->pfn_list);
> >> +			bool found = false;
> >> +
> >> +			for (; n; n = rb_next(n)) {
> >> +				struct vfio_pfn *vpfn = rb_entry(n,
> >> +						struct vfio_pfn, node);
> >> +				if (vpfn->iova >= i) {
> >> +					found = true;
> >> +					break;
> >> +				}
> >> +			}
> >> +
> >> +			if (!found) {
> >> +				i += dma->size;
> >> +				continue;
> >> +			}
> >> +
> >> +			for (; n; n = rb_next(n)) {
> >> +				unsigned int s;
> >> +				struct vfio_pfn *vpfn = rb_entry(n,
> >> +						struct vfio_pfn, node);
> >> +
> >> +				if (vpfn->iova >= iova + size)
> >> +					break;
> >> +
> >> +				s = (vpfn->iova - dma->iova) >> pgshift;
> >> +				bitmap_set(dma->bitmap, s, 1);
> >> +
> >> +				iova_limit = vpfn->iova + pgsize;
> >> +			}
> >> +			npages = iova_limit/pgsize;  
> > 
> > Isn't iova_limit potentially uninitialized here?  For example, if our
> > vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
> > there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
> > (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
> > 4096 and break, and npages = ????/pgsize.
> >   
> 
> Right, Fixing it.
> 
> >> +		}
> >> +
> >> +		bsize = dirty_bitmap_bytes(npages);
> >> +		shift = nbits % BITS_PER_BYTE;
> >> +
> >> +		if (npages && shift) {
> >> +			l--;
> >> +			if (!access_ok((void __user *)bitmap + l,
> >> +					sizeof(unsigned char)))
> >> +				return -EINVAL;
> >> +
> >> +			ret = __get_user(temp, bitmap + l);  
> > 
> > I don't understand why we care to get the user's bitmap, are we trying
> > to leave whatever garbage they might have set in it and only also set
> > the dirty bits?  That seems unnecessary.
> >   
> 
> Suppose dma mapped ranges are {start, size}:
> {0, 0xa000}, {0xa000, 0x10000}
> 
> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10 
> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
> In second iteration for dma {0xa000, 0x10000} there are 6 pages and 
> these bits should be appended to previous byte. So get_user() that byte, 
> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
> 
> Without get_user() and shift-OR, resulting bitmap would be
> 111111 00000011 11111111b which would be wrong.

Seems like if we use a put_user() approach then we should look for
adjacent vfio_dmas within the same byte/word/dword before we push it to
the user to avoid this sort of inefficiency.

> > Also why do we need these access_ok() checks when we already checked
> > the range at the start of the ioctl?  
> 
> Since pointer is updated runtime here, better to check that pointer 
> before using that pointer.

Sorry, I still don't understand this, we check access_ok() with a
pointer and a length, therefore as long as we're incrementing the
pointer within that length, why do we need to retest?

> >   
> >> +			if (ret)
> >> +				return ret;
> >> +		}
> >> +
> >> +		for (j = 0; j < bsize; j++, l++) {
> >> +			temp = temp |
> >> +			       (*((unsigned char *)dma->bitmap + j) << shift);  
> > 
> > |=
> >   
> >> +			if (!access_ok((void __user *)bitmap + l,
> >> +					sizeof(unsigned char)))
> >> +				return -EINVAL;
> >> +
> >> +			ret = __put_user(temp, bitmap + l);
> >> +			if (ret)
> >> +				return ret;
> >> +			if (shift) {
> >> +				temp = *((unsigned char *)dma->bitmap + j) >>
> >> +					(BITS_PER_BYTE - shift);
> >> +			}  
> > 
> > When shift == 0, temp just seems to accumulate bits that never get
> > cleared.
> >   
> 
> Hope example above explains the shift logic.

But that example is when shift is non-zero.  When shift is zero, each
iteration of the loop just ORs in new bits to temp without ever
clearing the bits for the previous iteration.


> >> +		}
> >> +
> >> +		nbits += npages;
> >> +
> >> +		i = min(dma->iova + dma->size, iova + size);
> >> +		if (i >= iova + size)
> >> +			break;  
> > 
> > So whether we error or succeed, we leave cruft in dma->bitmap for the
> > next pass.  It doesn't seem to make any sense why we pre-allocated the
> > bitmap, we might as well just allocate it on demand here.  Actually, if
> > we're not going to do a copy_to_user() for some range of the bitmap,
> > I'm not sure what it's purpose is at all.  I think the big advantages
> > of the bitmap are that we can't amortize the cost across every pinned
> > page or DMA mapping, we don't need the overhead of tracking unmapped
> > vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
> > getting any of those advantages here.
> >   
> 
> That would still not work if dma range size is not multiples of 8 pages. 
> See example above.

I don't understand this comment, what about the example above justifies
the bitmap?  As I understand the above algorithm, we find a vfio_dma
overlapping the request and populate the bitmap for that range.  Then
we go back and put_user() for each byte that we touched.  We could
instead simply work on a one byte buffer as we enumerate the requested
range and do a put_user() ever time we reach the end of it and have bits
set.  That would greatly simplify the above example.  But I would expect
that we're a) more likely to get asked for ranges covering a single
vfio_dma and b) we're going to spend far more time operating in the
middle of the range and limiting ourselves to one-byte operations there
seems absurd.  If we want to specify that the user provides 4-byte
aligned buffers and naturally aligned iova ranges to make our lives
easier in the kernel, now would be the time to do that.

> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> >> +{
> >> +	long bsize;
> >> +
> >> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> >> +		return -EINVAL;
> >> +
> >> +	bsize = dirty_bitmap_bytes(npages);
> >> +
> >> +	if (bitmap_size < bsize)
> >> +		return -EINVAL;
> >> +
> >> +	return bsize;
> >> +}  
> > 
> > Seems like this could simply return int, -errno or zero for success.
> > The returned bsize is not used for anything else.
> >   
> 
> ok.
> 
> >> +
> >>   static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>   			     struct vfio_iommu_type1_dma_unmap *unmap)
> >>   {
> >> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>   
> >>   		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>   			-EFAULT : 0;
> >> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> >> +		struct vfio_iommu_type1_dirty_bitmap range;
> >> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> >> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> >> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> >> +		int ret;
> >> +
> >> +		if (!iommu->v2)
> >> +			return -EACCES;
> >> +
> >> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> >> +				    bitmap);  
> > 
> > We require the user to provide iova, size, pgsize, bitmap_size, and
> > bitmap fields to START/STOP?  Why?
> >  
> 
> No. But those are part of structure.

But we do require it, minsz here includes all those fields, which would
probably make a user scratch their head wondering why they need to pass
irrelevant data for START/STOP.  It almost implies that we support
starting and stopping dirty logging for specific ranges of the IOVA
space.  We could define the structure, for example:

struct vfio_iommu_type1_dirty_bitmap {
	__u32	argsz;
	__u32	flags;
	__u8	data[];
};

struct vfio_iommu_type1_dirty_bitmap_get {
	__u64	iova;
	__u64	size;
	__u64	pgsize;
	__u64	bitmap_size;
	void __user *bitmap;
};

Where data[] is defined as the latter structure when FLAG_GET_BITMAP is
specified.  BTW, don't we need to specify the trailing void* as __u64?
We could theoretically be talking to an ILP32 user process.  Thanks,

Alex

> >> +
> >> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (range.argsz < minsz || range.flags & ~mask)
> >> +			return -EINVAL;
> >> +
> >> +		/* only one flag should be set at a time */
> >> +		if (__ffs(range.flags) != __fls(range.flags))
> >> +			return -EINVAL;
> >> +
> >> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> >> +
> >> +			mutex_lock(&iommu->lock);
> >> +			iommu->dirty_page_tracking = true;
> >> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);  
> > 
> > So dirty page tracking is enabled even if we fail to allocate all the
> > bitmaps?  Shouldn't this return an error if dirty tracking is already
> > enabled?
> >   
> 
> Adding error handling here in next patch.
> 
> >> +			mutex_unlock(&iommu->lock);
> >> +			return ret;
> >> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >> +			mutex_lock(&iommu->lock);
> >> +			iommu->dirty_page_tracking = false;  
> > 
> > Shouldn't we only allow STOP if tracking is enabled?
> >   
> 
> Right,adding.
> 
> >> +			vfio_dma_all_bitmap_free(iommu);  
> > 
> > Here's where that user induced double free enters the picture.
> >   
> 
> Error handling as mentioned above will prevent double free.
> 
> Thanks,
> Kirti
> 
> >> +			vfio_remove_unpinned_from_dma_list(iommu);
> >> +			mutex_unlock(&iommu->lock);
> >> +			return 0;
> >> +		} else if (range.flags &
> >> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >> +			long bsize;
> >> +			unsigned long pgshift = __ffs(range.pgsize);
> >> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
> >> +			uint64_t iommu_pgmask =
> >> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
> >> +
> >> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
> >> +				return -EINVAL;
> >> +			if (range.iova & iommu_pgmask)
> >> +				return -EINVAL;
> >> +			if (!range.size || range.size & iommu_pgmask)
> >> +				return -EINVAL;
> >> +			if (range.iova + range.size < range.iova)
> >> +				return -EINVAL;
> >> +			if (!access_ok((void __user *)range.bitmap,
> >> +				       range.bitmap_size))
> >> +				return -EINVAL;
> >> +
> >> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >> +						   range.bitmap_size);
> >> +			if (bsize < 0)
> >> +				return bsize;
> >> +
> >> +			mutex_lock(&iommu->lock);
> >> +			if (iommu->dirty_page_tracking)
> >> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
> >> +					 range.size, range.pgsize,
> >> +					 (unsigned char __user *)range.bitmap);
> >> +			else
> >> +				ret = -EINVAL;
> >> +			mutex_unlock(&iommu->lock);
> >> +
> >> +			return ret;
> >> +		}
> >>   	}
> >>   
> >>   	return -ENOTTY;  
> > 
> > Thanks,
> > Alex
> >   
> 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-12 23:13         ` Alex Williamson
@ 2020-02-13 20:11           ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-13 20:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm


<snip>

>>>>    
>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>>>> +				  size_t size, uint64_t pgsize,
>>>> +				  unsigned char __user *bitmap)
>>>> +{
>>>> +	struct vfio_dma *dma;
>>>> +	dma_addr_t i = iova, iova_limit;
>>>> +	unsigned int bsize, nbits = 0, l = 0;
>>>> +	unsigned long pgshift = __ffs(pgsize);
>>>> +
>>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
>>>> +		int ret, j;
>>>> +		unsigned int npages = 0, shift = 0;
>>>> +		unsigned char temp = 0;
>>>> +
>>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
>>>> +		if (dma->iommu_mapped) {
>>>> +			iova_limit = min(dma->iova + dma->size, iova + size);
>>>> +			npages = iova_limit/pgsize;
>>>> +			bitmap_set(dma->bitmap, 0, npages);
>>>
>>> npages is derived from iova_limit, which is the number of bits to set
>>> dirty relative to the first requested iova, not iova zero, ie. the set
>>> of dirty bits is offset from those requested unless iova == dma->iova.
>>>    
>>
>> Right, fixing.
>>
>>> Also I hope dma->bitmap was actually allocated.  Not only does the
>>> START error path potentially leave dirty tracking enabled without all
>>> the bitmap allocated, when does the bitmap get allocated for a new
>>> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
>>> vpfn gets marked dirty.
>>>    
>>
>> Right.
>>
>> Fixing error paths.
>>
>>
>>>> +		} else if (dma->bitmap) {
>>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
>>>> +			bool found = false;
>>>> +
>>>> +			for (; n; n = rb_next(n)) {
>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>> +						struct vfio_pfn, node);
>>>> +				if (vpfn->iova >= i) {
>>>> +					found = true;
>>>> +					break;
>>>> +				}
>>>> +			}
>>>> +
>>>> +			if (!found) {
>>>> +				i += dma->size;
>>>> +				continue;
>>>> +			}
>>>> +
>>>> +			for (; n; n = rb_next(n)) {
>>>> +				unsigned int s;
>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>> +						struct vfio_pfn, node);
>>>> +
>>>> +				if (vpfn->iova >= iova + size)
>>>> +					break;
>>>> +
>>>> +				s = (vpfn->iova - dma->iova) >> pgshift;
>>>> +				bitmap_set(dma->bitmap, s, 1);
>>>> +
>>>> +				iova_limit = vpfn->iova + pgsize;
>>>> +			}
>>>> +			npages = iova_limit/pgsize;
>>>
>>> Isn't iova_limit potentially uninitialized here?  For example, if our
>>> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
>>> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
>>> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
>>> 4096 and break, and npages = ????/pgsize.
>>>    
>>
>> Right, Fixing it.
>>
>>>> +		}
>>>> +
>>>> +		bsize = dirty_bitmap_bytes(npages);
>>>> +		shift = nbits % BITS_PER_BYTE;
>>>> +
>>>> +		if (npages && shift) {
>>>> +			l--;
>>>> +			if (!access_ok((void __user *)bitmap + l,
>>>> +					sizeof(unsigned char)))
>>>> +				return -EINVAL;
>>>> +
>>>> +			ret = __get_user(temp, bitmap + l);
>>>
>>> I don't understand why we care to get the user's bitmap, are we trying
>>> to leave whatever garbage they might have set in it and only also set
>>> the dirty bits?  That seems unnecessary.
>>>    
>>
>> Suppose dma mapped ranges are {start, size}:
>> {0, 0xa000}, {0xa000, 0x10000}
>>
>> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
>> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10
>> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
>> In second iteration for dma {0xa000, 0x10000} there are 6 pages and
>> these bits should be appended to previous byte. So get_user() that byte,
>> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
>>
>> Without get_user() and shift-OR, resulting bitmap would be
>> 111111 00000011 11111111b which would be wrong.
> 
> Seems like if we use a put_user() approach then we should look for
> adjacent vfio_dmas within the same byte/word/dword before we push it to
> the user to avoid this sort of inefficiency.
> 

Won't that add more complication to logic?

>>> Also why do we need these access_ok() checks when we already checked
>>> the range at the start of the ioctl?
>>
>> Since pointer is updated runtime here, better to check that pointer
>> before using that pointer.
> 
> Sorry, I still don't understand this, we check access_ok() with a
> pointer and a length, therefore as long as we're incrementing the
> pointer within that length, why do we need to retest?
> 

Ideally caller for put_user() and get_user() must check the pointer with 
access_ok() which is used as argument to these functions before calling 
this function. That makes sure that pointer is correct after pointer 
arithematic. May be lets remove previous check of pointer and length, 
but keep these checks.

>>>    
>>>> +			if (ret)
>>>> +				return ret;
>>>> +		}
>>>> +
>>>> +		for (j = 0; j < bsize; j++, l++) {
>>>> +			temp = temp |
>>>> +			       (*((unsigned char *)dma->bitmap + j) << shift);
>>>
>>> |=
>>>    
>>>> +			if (!access_ok((void __user *)bitmap + l,
>>>> +					sizeof(unsigned char)))
>>>> +				return -EINVAL;
>>>> +
>>>> +			ret = __put_user(temp, bitmap + l);
>>>> +			if (ret)
>>>> +				return ret;
>>>> +			if (shift) {
>>>> +				temp = *((unsigned char *)dma->bitmap + j) >>
>>>> +					(BITS_PER_BYTE - shift);
>>>> +			}
>>>
>>> When shift == 0, temp just seems to accumulate bits that never get
>>> cleared.
>>>    
>>
>> Hope example above explains the shift logic.
> 
> But that example is when shift is non-zero.  When shift is zero, each
> iteration of the loop just ORs in new bits to temp without ever
> clearing the bits for the previous iteration.
> 
> 

Oh right, fixing it.

>>>> +		}
>>>> +
>>>> +		nbits += npages;
>>>> +
>>>> +		i = min(dma->iova + dma->size, iova + size);
>>>> +		if (i >= iova + size)
>>>> +			break;
>>>
>>> So whether we error or succeed, we leave cruft in dma->bitmap for the
>>> next pass.  It doesn't seem to make any sense why we pre-allocated the
>>> bitmap, we might as well just allocate it on demand here.  Actually, if
>>> we're not going to do a copy_to_user() for some range of the bitmap,
>>> I'm not sure what it's purpose is at all.  I think the big advantages
>>> of the bitmap are that we can't amortize the cost across every pinned
>>> page or DMA mapping, we don't need the overhead of tracking unmapped
>>> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
>>> getting any of those advantages here.
>>>    
>>
>> That would still not work if dma range size is not multiples of 8 pages.
>> See example above.
> 
> I don't understand this comment, what about the example above justifies
> the bitmap?

copy_to_user() could be used if dma range size is not multiple of 8 pages.

>  As I understand the above algorithm, we find a vfio_dma
> overlapping the request and populate the bitmap for that range.  Then
> we go back and put_user() for each byte that we touched.  We could
> instead simply work on a one byte buffer as we enumerate the requested
> range and do a put_user() ever time we reach the end of it and have bits
> set. That would greatly simplify the above example.  But I would expect > that we're a) more likely to get asked for ranges covering a single
> vfio_dma 

QEMU ask for single vfio_dma during each iteration.

If we restrict this ABI to cover single vfio_dma only, then it 
simplifies the logic here. That was my original suggestion. Should we 
think about that again?

> and b) we're going to spend far more time operating in the
> middle of the range and limiting ourselves to one-byte operations there
> seems absurd.  If we want to specify that the user provides 4-byte
> aligned buffers and naturally aligned iova ranges to make our lives
> easier in the kernel, now would be the time to do that.
> 
>>>> +	}
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>>>> +{
>>>> +	long bsize;
>>>> +
>>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
>>>> +		return -EINVAL;
>>>> +
>>>> +	bsize = dirty_bitmap_bytes(npages);
>>>> +
>>>> +	if (bitmap_size < bsize)
>>>> +		return -EINVAL;
>>>> +
>>>> +	return bsize;
>>>> +}
>>>
>>> Seems like this could simply return int, -errno or zero for success.
>>> The returned bsize is not used for anything else.
>>>    
>>
>> ok.
>>
>>>> +
>>>>    static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>    			     struct vfio_iommu_type1_dma_unmap *unmap)
>>>>    {
>>>> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>    
>>>>    		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>    			-EFAULT : 0;
>>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>>>> +		struct vfio_iommu_type1_dirty_bitmap range;
>>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>>>> +		int ret;
>>>> +
>>>> +		if (!iommu->v2)
>>>> +			return -EACCES;
>>>> +
>>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>>>> +				    bitmap);
>>>
>>> We require the user to provide iova, size, pgsize, bitmap_size, and
>>> bitmap fields to START/STOP?  Why?
>>>   
>>
>> No. But those are part of structure.
> 
> But we do require it, minsz here includes all those fields, which would
> probably make a user scratch their head wondering why they need to pass
> irrelevant data for START/STOP.  It almost implies that we support
> starting and stopping dirty logging for specific ranges of the IOVA
> space.  We could define the structure, for example:
> 
> struct vfio_iommu_type1_dirty_bitmap {
> 	__u32	argsz;
> 	__u32	flags;
> 	__u8	data[];
> };
> 
> struct vfio_iommu_type1_dirty_bitmap_get {
> 	__u64	iova;
> 	__u64	size;
> 	__u64	pgsize;
> 	__u64	bitmap_size;
> 	void __user *bitmap;
> };
> 
> Where data[] is defined as the latter structure when FLAG_GET_BITMAP is
> specified.

Ok. Changing as above.

>  BTW, don't we need to specify the trailing void* as __u64?
> We could theoretically be talking to an ILP32 user process.  Thanks,
> 

Even on ILP32, using void* pointer will reserve the size required to 
save a pointer address. I don't think using void* should be problem.

Thanks,
Kirti


> Alex
> 
>>>> +
>>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
>>>> +			return -EFAULT;
>>>> +
>>>> +		if (range.argsz < minsz || range.flags & ~mask)
>>>> +			return -EINVAL;
>>>> +
>>>> +		/* only one flag should be set at a time */
>>>> +		if (__ffs(range.flags) != __fls(range.flags))
>>>> +			return -EINVAL;
>>>> +
>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>>>> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>>>> +
>>>> +			mutex_lock(&iommu->lock);
>>>> +			iommu->dirty_page_tracking = true;
>>>> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);
>>>
>>> So dirty page tracking is enabled even if we fail to allocate all the
>>> bitmaps?  Shouldn't this return an error if dirty tracking is already
>>> enabled?
>>>    
>>
>> Adding error handling here in next patch.
>>
>>>> +			mutex_unlock(&iommu->lock);
>>>> +			return ret;
>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>>>> +			mutex_lock(&iommu->lock);
>>>> +			iommu->dirty_page_tracking = false;
>>>
>>> Shouldn't we only allow STOP if tracking is enabled?
>>>    
>>
>> Right,adding.
>>
>>>> +			vfio_dma_all_bitmap_free(iommu);
>>>
>>> Here's where that user induced double free enters the picture.
>>>    
>>
>> Error handling as mentioned above will prevent double free.
>>
>> Thanks,
>> Kirti
>>
>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
>>>> +			mutex_unlock(&iommu->lock);
>>>> +			return 0;
>>>> +		} else if (range.flags &
>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>>>> +			long bsize;
>>>> +			unsigned long pgshift = __ffs(range.pgsize);
>>>> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>>>> +			uint64_t iommu_pgmask =
>>>> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
>>>> +
>>>> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
>>>> +				return -EINVAL;
>>>> +			if (range.iova & iommu_pgmask)
>>>> +				return -EINVAL;
>>>> +			if (!range.size || range.size & iommu_pgmask)
>>>> +				return -EINVAL;
>>>> +			if (range.iova + range.size < range.iova)
>>>> +				return -EINVAL;
>>>> +			if (!access_ok((void __user *)range.bitmap,
>>>> +				       range.bitmap_size))
>>>> +				return -EINVAL;
>>>> +
>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>>>> +						   range.bitmap_size);
>>>> +			if (bsize < 0)
>>>> +				return bsize;
>>>> +
>>>> +			mutex_lock(&iommu->lock);
>>>> +			if (iommu->dirty_page_tracking)
>>>> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
>>>> +					 range.size, range.pgsize,
>>>> +					 (unsigned char __user *)range.bitmap);
>>>> +			else
>>>> +				ret = -EINVAL;
>>>> +			mutex_unlock(&iommu->lock);
>>>> +
>>>> +			return ret;
>>>> +		}
>>>>    	}
>>>>    
>>>>    	return -ENOTTY;
>>>
>>> Thanks,
>>> Alex
>>>    
>>
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-13 20:11           ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-13 20:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue


<snip>

>>>>    
>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>>>> +				  size_t size, uint64_t pgsize,
>>>> +				  unsigned char __user *bitmap)
>>>> +{
>>>> +	struct vfio_dma *dma;
>>>> +	dma_addr_t i = iova, iova_limit;
>>>> +	unsigned int bsize, nbits = 0, l = 0;
>>>> +	unsigned long pgshift = __ffs(pgsize);
>>>> +
>>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
>>>> +		int ret, j;
>>>> +		unsigned int npages = 0, shift = 0;
>>>> +		unsigned char temp = 0;
>>>> +
>>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
>>>> +		if (dma->iommu_mapped) {
>>>> +			iova_limit = min(dma->iova + dma->size, iova + size);
>>>> +			npages = iova_limit/pgsize;
>>>> +			bitmap_set(dma->bitmap, 0, npages);
>>>
>>> npages is derived from iova_limit, which is the number of bits to set
>>> dirty relative to the first requested iova, not iova zero, ie. the set
>>> of dirty bits is offset from those requested unless iova == dma->iova.
>>>    
>>
>> Right, fixing.
>>
>>> Also I hope dma->bitmap was actually allocated.  Not only does the
>>> START error path potentially leave dirty tracking enabled without all
>>> the bitmap allocated, when does the bitmap get allocated for a new
>>> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
>>> vpfn gets marked dirty.
>>>    
>>
>> Right.
>>
>> Fixing error paths.
>>
>>
>>>> +		} else if (dma->bitmap) {
>>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
>>>> +			bool found = false;
>>>> +
>>>> +			for (; n; n = rb_next(n)) {
>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>> +						struct vfio_pfn, node);
>>>> +				if (vpfn->iova >= i) {
>>>> +					found = true;
>>>> +					break;
>>>> +				}
>>>> +			}
>>>> +
>>>> +			if (!found) {
>>>> +				i += dma->size;
>>>> +				continue;
>>>> +			}
>>>> +
>>>> +			for (; n; n = rb_next(n)) {
>>>> +				unsigned int s;
>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>> +						struct vfio_pfn, node);
>>>> +
>>>> +				if (vpfn->iova >= iova + size)
>>>> +					break;
>>>> +
>>>> +				s = (vpfn->iova - dma->iova) >> pgshift;
>>>> +				bitmap_set(dma->bitmap, s, 1);
>>>> +
>>>> +				iova_limit = vpfn->iova + pgsize;
>>>> +			}
>>>> +			npages = iova_limit/pgsize;
>>>
>>> Isn't iova_limit potentially uninitialized here?  For example, if our
>>> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
>>> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
>>> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
>>> 4096 and break, and npages = ????/pgsize.
>>>    
>>
>> Right, Fixing it.
>>
>>>> +		}
>>>> +
>>>> +		bsize = dirty_bitmap_bytes(npages);
>>>> +		shift = nbits % BITS_PER_BYTE;
>>>> +
>>>> +		if (npages && shift) {
>>>> +			l--;
>>>> +			if (!access_ok((void __user *)bitmap + l,
>>>> +					sizeof(unsigned char)))
>>>> +				return -EINVAL;
>>>> +
>>>> +			ret = __get_user(temp, bitmap + l);
>>>
>>> I don't understand why we care to get the user's bitmap, are we trying
>>> to leave whatever garbage they might have set in it and only also set
>>> the dirty bits?  That seems unnecessary.
>>>    
>>
>> Suppose dma mapped ranges are {start, size}:
>> {0, 0xa000}, {0xa000, 0x10000}
>>
>> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
>> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10
>> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
>> In second iteration for dma {0xa000, 0x10000} there are 6 pages and
>> these bits should be appended to previous byte. So get_user() that byte,
>> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
>>
>> Without get_user() and shift-OR, resulting bitmap would be
>> 111111 00000011 11111111b which would be wrong.
> 
> Seems like if we use a put_user() approach then we should look for
> adjacent vfio_dmas within the same byte/word/dword before we push it to
> the user to avoid this sort of inefficiency.
> 

Won't that add more complication to logic?

>>> Also why do we need these access_ok() checks when we already checked
>>> the range at the start of the ioctl?
>>
>> Since pointer is updated runtime here, better to check that pointer
>> before using that pointer.
> 
> Sorry, I still don't understand this, we check access_ok() with a
> pointer and a length, therefore as long as we're incrementing the
> pointer within that length, why do we need to retest?
> 

Ideally caller for put_user() and get_user() must check the pointer with 
access_ok() which is used as argument to these functions before calling 
this function. That makes sure that pointer is correct after pointer 
arithematic. May be lets remove previous check of pointer and length, 
but keep these checks.

>>>    
>>>> +			if (ret)
>>>> +				return ret;
>>>> +		}
>>>> +
>>>> +		for (j = 0; j < bsize; j++, l++) {
>>>> +			temp = temp |
>>>> +			       (*((unsigned char *)dma->bitmap + j) << shift);
>>>
>>> |=
>>>    
>>>> +			if (!access_ok((void __user *)bitmap + l,
>>>> +					sizeof(unsigned char)))
>>>> +				return -EINVAL;
>>>> +
>>>> +			ret = __put_user(temp, bitmap + l);
>>>> +			if (ret)
>>>> +				return ret;
>>>> +			if (shift) {
>>>> +				temp = *((unsigned char *)dma->bitmap + j) >>
>>>> +					(BITS_PER_BYTE - shift);
>>>> +			}
>>>
>>> When shift == 0, temp just seems to accumulate bits that never get
>>> cleared.
>>>    
>>
>> Hope example above explains the shift logic.
> 
> But that example is when shift is non-zero.  When shift is zero, each
> iteration of the loop just ORs in new bits to temp without ever
> clearing the bits for the previous iteration.
> 
> 

Oh right, fixing it.

>>>> +		}
>>>> +
>>>> +		nbits += npages;
>>>> +
>>>> +		i = min(dma->iova + dma->size, iova + size);
>>>> +		if (i >= iova + size)
>>>> +			break;
>>>
>>> So whether we error or succeed, we leave cruft in dma->bitmap for the
>>> next pass.  It doesn't seem to make any sense why we pre-allocated the
>>> bitmap, we might as well just allocate it on demand here.  Actually, if
>>> we're not going to do a copy_to_user() for some range of the bitmap,
>>> I'm not sure what it's purpose is at all.  I think the big advantages
>>> of the bitmap are that we can't amortize the cost across every pinned
>>> page or DMA mapping, we don't need the overhead of tracking unmapped
>>> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
>>> getting any of those advantages here.
>>>    
>>
>> That would still not work if dma range size is not multiples of 8 pages.
>> See example above.
> 
> I don't understand this comment, what about the example above justifies
> the bitmap?

copy_to_user() could be used if dma range size is not multiple of 8 pages.

>  As I understand the above algorithm, we find a vfio_dma
> overlapping the request and populate the bitmap for that range.  Then
> we go back and put_user() for each byte that we touched.  We could
> instead simply work on a one byte buffer as we enumerate the requested
> range and do a put_user() ever time we reach the end of it and have bits
> set. That would greatly simplify the above example.  But I would expect > that we're a) more likely to get asked for ranges covering a single
> vfio_dma 

QEMU ask for single vfio_dma during each iteration.

If we restrict this ABI to cover single vfio_dma only, then it 
simplifies the logic here. That was my original suggestion. Should we 
think about that again?

> and b) we're going to spend far more time operating in the
> middle of the range and limiting ourselves to one-byte operations there
> seems absurd.  If we want to specify that the user provides 4-byte
> aligned buffers and naturally aligned iova ranges to make our lives
> easier in the kernel, now would be the time to do that.
> 
>>>> +	}
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>>>> +{
>>>> +	long bsize;
>>>> +
>>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
>>>> +		return -EINVAL;
>>>> +
>>>> +	bsize = dirty_bitmap_bytes(npages);
>>>> +
>>>> +	if (bitmap_size < bsize)
>>>> +		return -EINVAL;
>>>> +
>>>> +	return bsize;
>>>> +}
>>>
>>> Seems like this could simply return int, -errno or zero for success.
>>> The returned bsize is not used for anything else.
>>>    
>>
>> ok.
>>
>>>> +
>>>>    static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>    			     struct vfio_iommu_type1_dma_unmap *unmap)
>>>>    {
>>>> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>    
>>>>    		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>    			-EFAULT : 0;
>>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>>>> +		struct vfio_iommu_type1_dirty_bitmap range;
>>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>>>> +		int ret;
>>>> +
>>>> +		if (!iommu->v2)
>>>> +			return -EACCES;
>>>> +
>>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>>>> +				    bitmap);
>>>
>>> We require the user to provide iova, size, pgsize, bitmap_size, and
>>> bitmap fields to START/STOP?  Why?
>>>   
>>
>> No. But those are part of structure.
> 
> But we do require it, minsz here includes all those fields, which would
> probably make a user scratch their head wondering why they need to pass
> irrelevant data for START/STOP.  It almost implies that we support
> starting and stopping dirty logging for specific ranges of the IOVA
> space.  We could define the structure, for example:
> 
> struct vfio_iommu_type1_dirty_bitmap {
> 	__u32	argsz;
> 	__u32	flags;
> 	__u8	data[];
> };
> 
> struct vfio_iommu_type1_dirty_bitmap_get {
> 	__u64	iova;
> 	__u64	size;
> 	__u64	pgsize;
> 	__u64	bitmap_size;
> 	void __user *bitmap;
> };
> 
> Where data[] is defined as the latter structure when FLAG_GET_BITMAP is
> specified.

Ok. Changing as above.

>  BTW, don't we need to specify the trailing void* as __u64?
> We could theoretically be talking to an ILP32 user process.  Thanks,
> 

Even on ILP32, using void* pointer will reserve the size required to 
save a pointer address. I don't think using void* should be problem.

Thanks,
Kirti


> Alex
> 
>>>> +
>>>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
>>>> +			return -EFAULT;
>>>> +
>>>> +		if (range.argsz < minsz || range.flags & ~mask)
>>>> +			return -EINVAL;
>>>> +
>>>> +		/* only one flag should be set at a time */
>>>> +		if (__ffs(range.flags) != __fls(range.flags))
>>>> +			return -EINVAL;
>>>> +
>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>>>> +			unsigned long iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>>>> +
>>>> +			mutex_lock(&iommu->lock);
>>>> +			iommu->dirty_page_tracking = true;
>>>> +			ret = vfio_dma_all_bitmap_alloc(iommu, iommu_pgsizes);
>>>
>>> So dirty page tracking is enabled even if we fail to allocate all the
>>> bitmaps?  Shouldn't this return an error if dirty tracking is already
>>> enabled?
>>>    
>>
>> Adding error handling here in next patch.
>>
>>>> +			mutex_unlock(&iommu->lock);
>>>> +			return ret;
>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>>>> +			mutex_lock(&iommu->lock);
>>>> +			iommu->dirty_page_tracking = false;
>>>
>>> Shouldn't we only allow STOP if tracking is enabled?
>>>    
>>
>> Right,adding.
>>
>>>> +			vfio_dma_all_bitmap_free(iommu);
>>>
>>> Here's where that user induced double free enters the picture.
>>>    
>>
>> Error handling as mentioned above will prevent double free.
>>
>> Thanks,
>> Kirti
>>
>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
>>>> +			mutex_unlock(&iommu->lock);
>>>> +			return 0;
>>>> +		} else if (range.flags &
>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>>>> +			long bsize;
>>>> +			unsigned long pgshift = __ffs(range.pgsize);
>>>> +			uint64_t iommu_pgsizes = vfio_pgsize_bitmap(iommu);
>>>> +			uint64_t iommu_pgmask =
>>>> +				 ((uint64_t)1 << __ffs(iommu_pgsizes)) - 1;
>>>> +
>>>> +			if ((range.pgsize & iommu_pgsizes) != range.pgsize)
>>>> +				return -EINVAL;
>>>> +			if (range.iova & iommu_pgmask)
>>>> +				return -EINVAL;
>>>> +			if (!range.size || range.size & iommu_pgmask)
>>>> +				return -EINVAL;
>>>> +			if (range.iova + range.size < range.iova)
>>>> +				return -EINVAL;
>>>> +			if (!access_ok((void __user *)range.bitmap,
>>>> +				       range.bitmap_size))
>>>> +				return -EINVAL;
>>>> +
>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>>>> +						   range.bitmap_size);
>>>> +			if (bsize < 0)
>>>> +				return bsize;
>>>> +
>>>> +			mutex_lock(&iommu->lock);
>>>> +			if (iommu->dirty_page_tracking)
>>>> +				ret = vfio_iova_dirty_bitmap(iommu, range.iova,
>>>> +					 range.size, range.pgsize,
>>>> +					 (unsigned char __user *)range.bitmap);
>>>> +			else
>>>> +				ret = -EINVAL;
>>>> +			mutex_unlock(&iommu->lock);
>>>> +
>>>> +			return ret;
>>>> +		}
>>>>    	}
>>>>    
>>>>    	return -ENOTTY;
>>>
>>> Thanks,
>>> Alex
>>>    
>>
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-13 20:11           ` Kirti Wankhede
@ 2020-02-13 23:20             ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-13 23:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Fri, 14 Feb 2020 01:41:35 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> <snip>
> 
> >>>>    
> >>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> >>>> +				  size_t size, uint64_t pgsize,
> >>>> +				  unsigned char __user *bitmap)
> >>>> +{
> >>>> +	struct vfio_dma *dma;
> >>>> +	dma_addr_t i = iova, iova_limit;
> >>>> +	unsigned int bsize, nbits = 0, l = 0;
> >>>> +	unsigned long pgshift = __ffs(pgsize);
> >>>> +
> >>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> >>>> +		int ret, j;
> >>>> +		unsigned int npages = 0, shift = 0;
> >>>> +		unsigned char temp = 0;
> >>>> +
> >>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
> >>>> +		if (dma->iommu_mapped) {
> >>>> +			iova_limit = min(dma->iova + dma->size, iova + size);
> >>>> +			npages = iova_limit/pgsize;
> >>>> +			bitmap_set(dma->bitmap, 0, npages);  
> >>>
> >>> npages is derived from iova_limit, which is the number of bits to set
> >>> dirty relative to the first requested iova, not iova zero, ie. the set
> >>> of dirty bits is offset from those requested unless iova == dma->iova.
> >>>      
> >>
> >> Right, fixing.
> >>  
> >>> Also I hope dma->bitmap was actually allocated.  Not only does the
> >>> START error path potentially leave dirty tracking enabled without all
> >>> the bitmap allocated, when does the bitmap get allocated for a new
> >>> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
> >>> vpfn gets marked dirty.
> >>>      
> >>
> >> Right.
> >>
> >> Fixing error paths.
> >>
> >>  
> >>>> +		} else if (dma->bitmap) {
> >>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
> >>>> +			bool found = false;
> >>>> +
> >>>> +			for (; n; n = rb_next(n)) {
> >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>> +						struct vfio_pfn, node);
> >>>> +				if (vpfn->iova >= i) {
> >>>> +					found = true;
> >>>> +					break;
> >>>> +				}
> >>>> +			}
> >>>> +
> >>>> +			if (!found) {
> >>>> +				i += dma->size;
> >>>> +				continue;
> >>>> +			}
> >>>> +
> >>>> +			for (; n; n = rb_next(n)) {
> >>>> +				unsigned int s;
> >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>> +						struct vfio_pfn, node);
> >>>> +
> >>>> +				if (vpfn->iova >= iova + size)
> >>>> +					break;
> >>>> +
> >>>> +				s = (vpfn->iova - dma->iova) >> pgshift;
> >>>> +				bitmap_set(dma->bitmap, s, 1);
> >>>> +
> >>>> +				iova_limit = vpfn->iova + pgsize;
> >>>> +			}
> >>>> +			npages = iova_limit/pgsize;  
> >>>
> >>> Isn't iova_limit potentially uninitialized here?  For example, if our
> >>> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
> >>> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
> >>> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
> >>> 4096 and break, and npages = ????/pgsize.
> >>>      
> >>
> >> Right, Fixing it.
> >>  
> >>>> +		}
> >>>> +
> >>>> +		bsize = dirty_bitmap_bytes(npages);
> >>>> +		shift = nbits % BITS_PER_BYTE;
> >>>> +
> >>>> +		if (npages && shift) {
> >>>> +			l--;
> >>>> +			if (!access_ok((void __user *)bitmap + l,
> >>>> +					sizeof(unsigned char)))
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			ret = __get_user(temp, bitmap + l);  
> >>>
> >>> I don't understand why we care to get the user's bitmap, are we trying
> >>> to leave whatever garbage they might have set in it and only also set
> >>> the dirty bits?  That seems unnecessary.
> >>>      
> >>
> >> Suppose dma mapped ranges are {start, size}:
> >> {0, 0xa000}, {0xa000, 0x10000}
> >>
> >> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
> >> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10
> >> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
> >> In second iteration for dma {0xa000, 0x10000} there are 6 pages and
> >> these bits should be appended to previous byte. So get_user() that byte,
> >> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
> >>
> >> Without get_user() and shift-OR, resulting bitmap would be
> >> 111111 00000011 11111111b which would be wrong.  
> > 
> > Seems like if we use a put_user() approach then we should look for
> > adjacent vfio_dmas within the same byte/word/dword before we push it to
> > the user to avoid this sort of inefficiency.
> >   
> 
> Won't that add more complication to logic?

I'm tempted to think it might be less complicated.
 
> >>> Also why do we need these access_ok() checks when we already checked
> >>> the range at the start of the ioctl?  
> >>
> >> Since pointer is updated runtime here, better to check that pointer
> >> before using that pointer.  
> > 
> > Sorry, I still don't understand this, we check access_ok() with a
> > pointer and a length, therefore as long as we're incrementing the
> > pointer within that length, why do we need to retest?
> >   
> 
> Ideally caller for put_user() and get_user() must check the pointer with 
> access_ok() which is used as argument to these functions before calling 
> this function. That makes sure that pointer is correct after pointer 
> arithematic. May be lets remove previous check of pointer and length, 
> but keep these checks.

So we don't trust that we can increment a pointer within a range that
we've already tested with access_ok() and expect it to still be ok?  I
think the point of having access_ok() and __put_user() is that we can
batch many __put_user() calls under a single access_ok() check.  I
don't see any justification here why if we already tested
access_ok(ptr, 2) that we still need to test access_ok(ptr + 0, 1) and
access_ok(ptr + 1, 1), and removing the initial test is clearly the
wrong optimization if we agree there is redundancy here.	

> >>>> +			if (ret)
> >>>> +				return ret;
> >>>> +		}
> >>>> +
> >>>> +		for (j = 0; j < bsize; j++, l++) {
> >>>> +			temp = temp |
> >>>> +			       (*((unsigned char *)dma->bitmap + j) << shift);  
> >>>
> >>> |=
> >>>      
> >>>> +			if (!access_ok((void __user *)bitmap + l,
> >>>> +					sizeof(unsigned char)))
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			ret = __put_user(temp, bitmap + l);
> >>>> +			if (ret)
> >>>> +				return ret;
> >>>> +			if (shift) {
> >>>> +				temp = *((unsigned char *)dma->bitmap + j) >>
> >>>> +					(BITS_PER_BYTE - shift);
> >>>> +			}  
> >>>
> >>> When shift == 0, temp just seems to accumulate bits that never get
> >>> cleared.
> >>>      
> >>
> >> Hope example above explains the shift logic.  
> > 
> > But that example is when shift is non-zero.  When shift is zero, each
> > iteration of the loop just ORs in new bits to temp without ever
> > clearing the bits for the previous iteration.
> > 
> >   
> 
> Oh right, fixing it.
> 
> >>>> +		}
> >>>> +
> >>>> +		nbits += npages;
> >>>> +
> >>>> +		i = min(dma->iova + dma->size, iova + size);
> >>>> +		if (i >= iova + size)
> >>>> +			break;  
> >>>
> >>> So whether we error or succeed, we leave cruft in dma->bitmap for the
> >>> next pass.  It doesn't seem to make any sense why we pre-allocated the
> >>> bitmap, we might as well just allocate it on demand here.  Actually, if
> >>> we're not going to do a copy_to_user() for some range of the bitmap,
> >>> I'm not sure what it's purpose is at all.  I think the big advantages
> >>> of the bitmap are that we can't amortize the cost across every pinned
> >>> page or DMA mapping, we don't need the overhead of tracking unmapped
> >>> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
> >>> getting any of those advantages here.
> >>>      
> >>
> >> That would still not work if dma range size is not multiples of 8 pages.
> >> See example above.  
> > 
> > I don't understand this comment, what about the example above justifies
> > the bitmap?  
> 
> copy_to_user() could be used if dma range size is not multiple of 8 pages.

s/is not/is/ ?

And we expect that to be a far more common case, right?  I don't think
there are too many ranges for a guest that are only mapped in sub-32KB
chucks.
 
> >  As I understand the above algorithm, we find a vfio_dma
> > overlapping the request and populate the bitmap for that range.  Then
> > we go back and put_user() for each byte that we touched.  We could
> > instead simply work on a one byte buffer as we enumerate the requested
> > range and do a put_user() ever time we reach the end of it and have bits
> > set. That would greatly simplify the above example.  But I would expect
> > that we're a) more likely to get asked for ranges covering a single
> > vfio_dma   
> 
> QEMU ask for single vfio_dma during each iteration.
> 
> If we restrict this ABI to cover single vfio_dma only, then it 
> simplifies the logic here. That was my original suggestion. Should we 
> think about that again?

But we currently allow unmaps that overlap multiple vfio_dmas as long
as no vfio_dma is bisected, so I think that implies that an unmap while
asking for the dirty bitmap has even further restricted semantics.  I'm
also reluctant to design an ABI around what happens to be the current
QEMU implementation.

If we take your example above, ranges {0x0000,0xa000} and
{0xa000,0x10000} ({start,end}), I think you're working with the
following two bitmaps in this implementation:

00000011 11111111b
00111111b

And we need to combine those into:

11111111 11111111b

Right?

But it seems like that would be easier if the second bitmap was instead:

11111100b

Then we wouldn't need to worry about the entire bitmap being shifted by
the bit offset within the byte, which limits our fixes to the boundary
byte and allows us to use copy_to_user() directly for the bulk of the
copy.  So how do we get there?

I think we start with allocating the vfio_dma bitmap to account for
this initial offset, so we calculate bitmap_base_iova as:
  (iova & ~((PAGE_SIZE << 3) - 1))
We then use bitmap_base_iova in calculating which bits to set.

The user needs to follow the same rules, and maybe this adds some value
to the user providing the bitmap size rather than the kernel
calculating it.  For example, if the user wanted the dirty bitmap for
the range {0xa000,0x10000} above, they'd provide at least a 1 byte
bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.

Effectively the user can ask for any iova range, but the buffer will be
filled relative to the zeroth bit of the bitmap following the above
bitmap_base_iova formula (and replacing PAGE_SIZE with the user
requested pgsize).  I'm tempted to make this explicit in the user
interface (ie. only allow bitmaps starting on aligned pages), but a
user is able to map and unmap single pages and we need to support
returning a dirty bitmap with an unmap, so I don't think we can do that.

So now are we biting off more than we can chew trying to transpose the
bitmap between page sizes?  If asked for the previous range with an 8K
pgsize, we'd somehow need to translate 11111100b into 00001110b.
What's worse, the user could ask for just the 8K page at 0xa000 and we'd
need to return back 00000010b while leaving our internal bitmap a
11110000b after we mark the bits clean.  Seems like this is really
only tenable if we do multiples of PAGE_SIZE pages within a byte, so
for 4K we'd have 32K, 64K, 128K, 256K, etc.  I'm somewhat losing sight
on what this accomplishes though and whether we need this in the first
implementation.  Should we simplify by dropping this aspect of it,
supporting only the minimum iommu page size, and focus on actually
using the bitmaps effectively?
 
> > and b) we're going to spend far more time operating in the
> > middle of the range and limiting ourselves to one-byte operations there
> > seems absurd.  If we want to specify that the user provides 4-byte
> > aligned buffers and naturally aligned iova ranges to make our lives
> > easier in the kernel, now would be the time to do that.
> >   
> >>>> +	}
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> >>>> +{
> >>>> +	long bsize;
> >>>> +
> >>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	bsize = dirty_bitmap_bytes(npages);
> >>>> +
> >>>> +	if (bitmap_size < bsize)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	return bsize;
> >>>> +}  
> >>>
> >>> Seems like this could simply return int, -errno or zero for success.
> >>> The returned bsize is not used for anything else.
> >>>      
> >>
> >> ok.
> >>  
> >>>> +
> >>>>    static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>>>    			     struct vfio_iommu_type1_dma_unmap *unmap)
> >>>>    {
> >>>> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>>>    
> >>>>    		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>>>    			-EFAULT : 0;
> >>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> >>>> +		struct vfio_iommu_type1_dirty_bitmap range;
> >>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> >>>> +		int ret;
> >>>> +
> >>>> +		if (!iommu->v2)
> >>>> +			return -EACCES;
> >>>> +
> >>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> >>>> +				    bitmap);  
> >>>
> >>> We require the user to provide iova, size, pgsize, bitmap_size, and
> >>> bitmap fields to START/STOP?  Why?
> >>>     
> >>
> >> No. But those are part of structure.  
> > 
> > But we do require it, minsz here includes all those fields, which would
> > probably make a user scratch their head wondering why they need to pass
> > irrelevant data for START/STOP.  It almost implies that we support
> > starting and stopping dirty logging for specific ranges of the IOVA
> > space.  We could define the structure, for example:
> > 
> > struct vfio_iommu_type1_dirty_bitmap {
> > 	__u32	argsz;
> > 	__u32	flags;
> > 	__u8	data[];
> > };
> > 
> > struct vfio_iommu_type1_dirty_bitmap_get {
> > 	__u64	iova;
> > 	__u64	size;
> > 	__u64	pgsize;
> > 	__u64	bitmap_size;
> > 	void __user *bitmap;
> > };
> > 
> > Where data[] is defined as the latter structure when FLAG_GET_BITMAP is
> > specified.  
> 
> Ok. Changing as above.
> 
> >  BTW, don't we need to specify the trailing void* as __u64?
> > We could theoretically be talking to an ILP32 user process.  Thanks,
> >   
> 
> Even on ILP32, using void* pointer will reserve the size required to 
> save a pointer address. I don't think using void* should be problem.

I think you're still assuming sizeof(void *) is the same in kernel vs
userspace whereas I'm thinking about an ILP32 user running on an LP64
kernel.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-13 23:20             ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-13 23:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Fri, 14 Feb 2020 01:41:35 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> <snip>
> 
> >>>>    
> >>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> >>>> +				  size_t size, uint64_t pgsize,
> >>>> +				  unsigned char __user *bitmap)
> >>>> +{
> >>>> +	struct vfio_dma *dma;
> >>>> +	dma_addr_t i = iova, iova_limit;
> >>>> +	unsigned int bsize, nbits = 0, l = 0;
> >>>> +	unsigned long pgshift = __ffs(pgsize);
> >>>> +
> >>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> >>>> +		int ret, j;
> >>>> +		unsigned int npages = 0, shift = 0;
> >>>> +		unsigned char temp = 0;
> >>>> +
> >>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
> >>>> +		if (dma->iommu_mapped) {
> >>>> +			iova_limit = min(dma->iova + dma->size, iova + size);
> >>>> +			npages = iova_limit/pgsize;
> >>>> +			bitmap_set(dma->bitmap, 0, npages);  
> >>>
> >>> npages is derived from iova_limit, which is the number of bits to set
> >>> dirty relative to the first requested iova, not iova zero, ie. the set
> >>> of dirty bits is offset from those requested unless iova == dma->iova.
> >>>      
> >>
> >> Right, fixing.
> >>  
> >>> Also I hope dma->bitmap was actually allocated.  Not only does the
> >>> START error path potentially leave dirty tracking enabled without all
> >>> the bitmap allocated, when does the bitmap get allocated for a new
> >>> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
> >>> vpfn gets marked dirty.
> >>>      
> >>
> >> Right.
> >>
> >> Fixing error paths.
> >>
> >>  
> >>>> +		} else if (dma->bitmap) {
> >>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
> >>>> +			bool found = false;
> >>>> +
> >>>> +			for (; n; n = rb_next(n)) {
> >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>> +						struct vfio_pfn, node);
> >>>> +				if (vpfn->iova >= i) {
> >>>> +					found = true;
> >>>> +					break;
> >>>> +				}
> >>>> +			}
> >>>> +
> >>>> +			if (!found) {
> >>>> +				i += dma->size;
> >>>> +				continue;
> >>>> +			}
> >>>> +
> >>>> +			for (; n; n = rb_next(n)) {
> >>>> +				unsigned int s;
> >>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>> +						struct vfio_pfn, node);
> >>>> +
> >>>> +				if (vpfn->iova >= iova + size)
> >>>> +					break;
> >>>> +
> >>>> +				s = (vpfn->iova - dma->iova) >> pgshift;
> >>>> +				bitmap_set(dma->bitmap, s, 1);
> >>>> +
> >>>> +				iova_limit = vpfn->iova + pgsize;
> >>>> +			}
> >>>> +			npages = iova_limit/pgsize;  
> >>>
> >>> Isn't iova_limit potentially uninitialized here?  For example, if our
> >>> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
> >>> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
> >>> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
> >>> 4096 and break, and npages = ????/pgsize.
> >>>      
> >>
> >> Right, Fixing it.
> >>  
> >>>> +		}
> >>>> +
> >>>> +		bsize = dirty_bitmap_bytes(npages);
> >>>> +		shift = nbits % BITS_PER_BYTE;
> >>>> +
> >>>> +		if (npages && shift) {
> >>>> +			l--;
> >>>> +			if (!access_ok((void __user *)bitmap + l,
> >>>> +					sizeof(unsigned char)))
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			ret = __get_user(temp, bitmap + l);  
> >>>
> >>> I don't understand why we care to get the user's bitmap, are we trying
> >>> to leave whatever garbage they might have set in it and only also set
> >>> the dirty bits?  That seems unnecessary.
> >>>      
> >>
> >> Suppose dma mapped ranges are {start, size}:
> >> {0, 0xa000}, {0xa000, 0x10000}
> >>
> >> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
> >> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10
> >> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
> >> In second iteration for dma {0xa000, 0x10000} there are 6 pages and
> >> these bits should be appended to previous byte. So get_user() that byte,
> >> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
> >>
> >> Without get_user() and shift-OR, resulting bitmap would be
> >> 111111 00000011 11111111b which would be wrong.  
> > 
> > Seems like if we use a put_user() approach then we should look for
> > adjacent vfio_dmas within the same byte/word/dword before we push it to
> > the user to avoid this sort of inefficiency.
> >   
> 
> Won't that add more complication to logic?

I'm tempted to think it might be less complicated.
 
> >>> Also why do we need these access_ok() checks when we already checked
> >>> the range at the start of the ioctl?  
> >>
> >> Since pointer is updated runtime here, better to check that pointer
> >> before using that pointer.  
> > 
> > Sorry, I still don't understand this, we check access_ok() with a
> > pointer and a length, therefore as long as we're incrementing the
> > pointer within that length, why do we need to retest?
> >   
> 
> Ideally caller for put_user() and get_user() must check the pointer with 
> access_ok() which is used as argument to these functions before calling 
> this function. That makes sure that pointer is correct after pointer 
> arithematic. May be lets remove previous check of pointer and length, 
> but keep these checks.

So we don't trust that we can increment a pointer within a range that
we've already tested with access_ok() and expect it to still be ok?  I
think the point of having access_ok() and __put_user() is that we can
batch many __put_user() calls under a single access_ok() check.  I
don't see any justification here why if we already tested
access_ok(ptr, 2) that we still need to test access_ok(ptr + 0, 1) and
access_ok(ptr + 1, 1), and removing the initial test is clearly the
wrong optimization if we agree there is redundancy here.	

> >>>> +			if (ret)
> >>>> +				return ret;
> >>>> +		}
> >>>> +
> >>>> +		for (j = 0; j < bsize; j++, l++) {
> >>>> +			temp = temp |
> >>>> +			       (*((unsigned char *)dma->bitmap + j) << shift);  
> >>>
> >>> |=
> >>>      
> >>>> +			if (!access_ok((void __user *)bitmap + l,
> >>>> +					sizeof(unsigned char)))
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			ret = __put_user(temp, bitmap + l);
> >>>> +			if (ret)
> >>>> +				return ret;
> >>>> +			if (shift) {
> >>>> +				temp = *((unsigned char *)dma->bitmap + j) >>
> >>>> +					(BITS_PER_BYTE - shift);
> >>>> +			}  
> >>>
> >>> When shift == 0, temp just seems to accumulate bits that never get
> >>> cleared.
> >>>      
> >>
> >> Hope example above explains the shift logic.  
> > 
> > But that example is when shift is non-zero.  When shift is zero, each
> > iteration of the loop just ORs in new bits to temp without ever
> > clearing the bits for the previous iteration.
> > 
> >   
> 
> Oh right, fixing it.
> 
> >>>> +		}
> >>>> +
> >>>> +		nbits += npages;
> >>>> +
> >>>> +		i = min(dma->iova + dma->size, iova + size);
> >>>> +		if (i >= iova + size)
> >>>> +			break;  
> >>>
> >>> So whether we error or succeed, we leave cruft in dma->bitmap for the
> >>> next pass.  It doesn't seem to make any sense why we pre-allocated the
> >>> bitmap, we might as well just allocate it on demand here.  Actually, if
> >>> we're not going to do a copy_to_user() for some range of the bitmap,
> >>> I'm not sure what it's purpose is at all.  I think the big advantages
> >>> of the bitmap are that we can't amortize the cost across every pinned
> >>> page or DMA mapping, we don't need the overhead of tracking unmapped
> >>> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
> >>> getting any of those advantages here.
> >>>      
> >>
> >> That would still not work if dma range size is not multiples of 8 pages.
> >> See example above.  
> > 
> > I don't understand this comment, what about the example above justifies
> > the bitmap?  
> 
> copy_to_user() could be used if dma range size is not multiple of 8 pages.

s/is not/is/ ?

And we expect that to be a far more common case, right?  I don't think
there are too many ranges for a guest that are only mapped in sub-32KB
chucks.
 
> >  As I understand the above algorithm, we find a vfio_dma
> > overlapping the request and populate the bitmap for that range.  Then
> > we go back and put_user() for each byte that we touched.  We could
> > instead simply work on a one byte buffer as we enumerate the requested
> > range and do a put_user() ever time we reach the end of it and have bits
> > set. That would greatly simplify the above example.  But I would expect
> > that we're a) more likely to get asked for ranges covering a single
> > vfio_dma   
> 
> QEMU ask for single vfio_dma during each iteration.
> 
> If we restrict this ABI to cover single vfio_dma only, then it 
> simplifies the logic here. That was my original suggestion. Should we 
> think about that again?

But we currently allow unmaps that overlap multiple vfio_dmas as long
as no vfio_dma is bisected, so I think that implies that an unmap while
asking for the dirty bitmap has even further restricted semantics.  I'm
also reluctant to design an ABI around what happens to be the current
QEMU implementation.

If we take your example above, ranges {0x0000,0xa000} and
{0xa000,0x10000} ({start,end}), I think you're working with the
following two bitmaps in this implementation:

00000011 11111111b
00111111b

And we need to combine those into:

11111111 11111111b

Right?

But it seems like that would be easier if the second bitmap was instead:

11111100b

Then we wouldn't need to worry about the entire bitmap being shifted by
the bit offset within the byte, which limits our fixes to the boundary
byte and allows us to use copy_to_user() directly for the bulk of the
copy.  So how do we get there?

I think we start with allocating the vfio_dma bitmap to account for
this initial offset, so we calculate bitmap_base_iova as:
  (iova & ~((PAGE_SIZE << 3) - 1))
We then use bitmap_base_iova in calculating which bits to set.

The user needs to follow the same rules, and maybe this adds some value
to the user providing the bitmap size rather than the kernel
calculating it.  For example, if the user wanted the dirty bitmap for
the range {0xa000,0x10000} above, they'd provide at least a 1 byte
bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.

Effectively the user can ask for any iova range, but the buffer will be
filled relative to the zeroth bit of the bitmap following the above
bitmap_base_iova formula (and replacing PAGE_SIZE with the user
requested pgsize).  I'm tempted to make this explicit in the user
interface (ie. only allow bitmaps starting on aligned pages), but a
user is able to map and unmap single pages and we need to support
returning a dirty bitmap with an unmap, so I don't think we can do that.

So now are we biting off more than we can chew trying to transpose the
bitmap between page sizes?  If asked for the previous range with an 8K
pgsize, we'd somehow need to translate 11111100b into 00001110b.
What's worse, the user could ask for just the 8K page at 0xa000 and we'd
need to return back 00000010b while leaving our internal bitmap a
11110000b after we mark the bits clean.  Seems like this is really
only tenable if we do multiples of PAGE_SIZE pages within a byte, so
for 4K we'd have 32K, 64K, 128K, 256K, etc.  I'm somewhat losing sight
on what this accomplishes though and whether we need this in the first
implementation.  Should we simplify by dropping this aspect of it,
supporting only the minimum iommu page size, and focus on actually
using the bitmaps effectively?
 
> > and b) we're going to spend far more time operating in the
> > middle of the range and limiting ourselves to one-byte operations there
> > seems absurd.  If we want to specify that the user provides 4-byte
> > aligned buffers and naturally aligned iova ranges to make our lives
> > easier in the kernel, now would be the time to do that.
> >   
> >>>> +	}
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> >>>> +{
> >>>> +	long bsize;
> >>>> +
> >>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	bsize = dirty_bitmap_bytes(npages);
> >>>> +
> >>>> +	if (bitmap_size < bsize)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	return bsize;
> >>>> +}  
> >>>
> >>> Seems like this could simply return int, -errno or zero for success.
> >>> The returned bsize is not used for anything else.
> >>>      
> >>
> >> ok.
> >>  
> >>>> +
> >>>>    static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>>>    			     struct vfio_iommu_type1_dma_unmap *unmap)
> >>>>    {
> >>>> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>>>    
> >>>>    		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>>>    			-EFAULT : 0;
> >>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> >>>> +		struct vfio_iommu_type1_dirty_bitmap range;
> >>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> >>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> >>>> +		int ret;
> >>>> +
> >>>> +		if (!iommu->v2)
> >>>> +			return -EACCES;
> >>>> +
> >>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> >>>> +				    bitmap);  
> >>>
> >>> We require the user to provide iova, size, pgsize, bitmap_size, and
> >>> bitmap fields to START/STOP?  Why?
> >>>     
> >>
> >> No. But those are part of structure.  
> > 
> > But we do require it, minsz here includes all those fields, which would
> > probably make a user scratch their head wondering why they need to pass
> > irrelevant data for START/STOP.  It almost implies that we support
> > starting and stopping dirty logging for specific ranges of the IOVA
> > space.  We could define the structure, for example:
> > 
> > struct vfio_iommu_type1_dirty_bitmap {
> > 	__u32	argsz;
> > 	__u32	flags;
> > 	__u8	data[];
> > };
> > 
> > struct vfio_iommu_type1_dirty_bitmap_get {
> > 	__u64	iova;
> > 	__u64	size;
> > 	__u64	pgsize;
> > 	__u64	bitmap_size;
> > 	void __user *bitmap;
> > };
> > 
> > Where data[] is defined as the latter structure when FLAG_GET_BITMAP is
> > specified.  
> 
> Ok. Changing as above.
> 
> >  BTW, don't we need to specify the trailing void* as __u64?
> > We could theoretically be talking to an ILP32 user process.  Thanks,
> >   
> 
> Even on ILP32, using void* pointer will reserve the size required to 
> save a pointer address. I don't think using void* should be problem.

I think you're still assuming sizeof(void *) is the same in kernel vs
userspace whereas I'm thinking about an ILP32 user running on an LP64
kernel.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
  2020-02-07 19:42   ` Kirti Wankhede
@ 2020-02-14 10:21     ` Cornelia Huck
  -1 siblings, 0 replies; 62+ messages in thread
From: Cornelia Huck @ 2020-02-14 10:21 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, kevin.tian, ziye.yang, changpeng.liu,
	yi.l.liu, mlevitsk, eskultet, dgilbert, jonathan.davies, eauger,
	aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	zhi.a.wang, yan.y.zhao, qemu-devel, kvm

On Sat, 8 Feb 2020 01:12:28 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

(...)

Minor wording nits:

> +/*
> + * Structure vfio_device_migration_info is placed at 0th offset of

"...at the 0th offset..."

> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      - User application writes this field to inform vendor driver about the

I'd probably add a definitive article before "user application",
"vendor driver", etc. Not sure if it's too much churn.

> + *        device state to be transitioned to.
> + *      - Vendor driver should take necessary actions to change device state.
> + *        On successful transition to given state, vendor driver should return
> + *        success on write(device_state, state) system call. If device state
> + *        transition fails, vendor driver should return error, -EFAULT.
> + *      - On user application side, if device state transition fails, i.e. if
> + *        write(device_state, state) returns error, read device_state again to
> + *        determine the current state of the device from vendor driver.
> + *      - Vendor driver should return previous state of the device unless vendor
> + *        driver has encountered an internal error, in which case vendor driver
> + *        may report the device_state VFIO_DEVICE_STATE_ERROR.
> + *	- User application must use the device reset ioctl in order to recover
> + *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
> + *	  indicated in a valid device state via reading device_state, the user
> + *	  application may decide attempt to transition the device to any valid
> + *	  state reachable from the current state or terminate itself.
> + *
> + *      device_state consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
> + *	  indicates _STOP state. When device is changed to _STOP, driver should
> + *	  stop device before write() returns.

"If set, bit 0 indicates _RUNNING state. If unset, it indicates _STOP
state. When the device is changed to _STOP state, the driver should
stop the device before write() returns."

?

> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> + *        should start gathering device state information which will be provided
> + *        to VFIO user application to save device's state.

"If set, bit 1 indicates _SAVING state. When it is set, the driver
should start to gather the device state information that will be
provided to the VFIO user application to save the device's state."

?

> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> + *        prepare to resume device, data provided through migration region
> + *        should be used to resume device.

"If set, bit 2 indicates _RESUMING state. When it is set, the driver
should prepare to resume the device, using the data provided via the
migration region."

?

> + *      Bits 3 - 31 are reserved for future use. In order to preserve them,
> + *	user application should perform read-modify-write operation on this
> + *	field when modifying the specified bits.

"In order to preserve them, the user application should use a
read-modify-write operation on the device_state field when modifying
the state."

?


(...)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
@ 2020-02-14 10:21     ` Cornelia Huck
  0 siblings, 0 replies; 62+ messages in thread
From: Cornelia Huck @ 2020-02-14 10:21 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, yi.l.liu, cjia, kvm, eskultet, ziye.yang, qemu-devel,
	Zhengxiao.zx, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, alex.williamson, eauger, felipe, jonathan.davies,
	yan.y.zhao, changpeng.liu, Ken.Xue

On Sat, 8 Feb 2020 01:12:28 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

(...)

Minor wording nits:

> +/*
> + * Structure vfio_device_migration_info is placed at 0th offset of

"...at the 0th offset..."

> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      - User application writes this field to inform vendor driver about the

I'd probably add a definitive article before "user application",
"vendor driver", etc. Not sure if it's too much churn.

> + *        device state to be transitioned to.
> + *      - Vendor driver should take necessary actions to change device state.
> + *        On successful transition to given state, vendor driver should return
> + *        success on write(device_state, state) system call. If device state
> + *        transition fails, vendor driver should return error, -EFAULT.
> + *      - On user application side, if device state transition fails, i.e. if
> + *        write(device_state, state) returns error, read device_state again to
> + *        determine the current state of the device from vendor driver.
> + *      - Vendor driver should return previous state of the device unless vendor
> + *        driver has encountered an internal error, in which case vendor driver
> + *        may report the device_state VFIO_DEVICE_STATE_ERROR.
> + *	- User application must use the device reset ioctl in order to recover
> + *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
> + *	  indicated in a valid device state via reading device_state, the user
> + *	  application may decide attempt to transition the device to any valid
> + *	  state reachable from the current state or terminate itself.
> + *
> + *      device_state consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
> + *	  indicates _STOP state. When device is changed to _STOP, driver should
> + *	  stop device before write() returns.

"If set, bit 0 indicates _RUNNING state. If unset, it indicates _STOP
state. When the device is changed to _STOP state, the driver should
stop the device before write() returns."

?

> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> + *        should start gathering device state information which will be provided
> + *        to VFIO user application to save device's state.

"If set, bit 1 indicates _SAVING state. When it is set, the driver
should start to gather the device state information that will be
provided to the VFIO user application to save the device's state."

?

> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> + *        prepare to resume device, data provided through migration region
> + *        should be used to resume device.

"If set, bit 2 indicates _RESUMING state. When it is set, the driver
should prepare to resume the device, using the data provided via the
migration region."

?

> + *      Bits 3 - 31 are reserved for future use. In order to preserve them,
> + *	user application should perform read-modify-write operation on this
> + *	field when modifying the specified bits.

"In order to preserve them, the user application should use a
read-modify-write operation on the device_state field when modifying
the state."

?


(...)



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-13 23:20             ` Alex Williamson
@ 2020-02-17 19:13               ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-17 19:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 2/14/2020 4:50 AM, Alex Williamson wrote:
> On Fri, 14 Feb 2020 01:41:35 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> <snip>
>>
>>>>>>     
>>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>>>>>> +				  size_t size, uint64_t pgsize,
>>>>>> +				  unsigned char __user *bitmap)
>>>>>> +{
>>>>>> +	struct vfio_dma *dma;
>>>>>> +	dma_addr_t i = iova, iova_limit;
>>>>>> +	unsigned int bsize, nbits = 0, l = 0;
>>>>>> +	unsigned long pgshift = __ffs(pgsize);
>>>>>> +
>>>>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
>>>>>> +		int ret, j;
>>>>>> +		unsigned int npages = 0, shift = 0;
>>>>>> +		unsigned char temp = 0;
>>>>>> +
>>>>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
>>>>>> +		if (dma->iommu_mapped) {
>>>>>> +			iova_limit = min(dma->iova + dma->size, iova + size);
>>>>>> +			npages = iova_limit/pgsize;
>>>>>> +			bitmap_set(dma->bitmap, 0, npages);
>>>>>
>>>>> npages is derived from iova_limit, which is the number of bits to set
>>>>> dirty relative to the first requested iova, not iova zero, ie. the set
>>>>> of dirty bits is offset from those requested unless iova == dma->iova.
>>>>>       
>>>>
>>>> Right, fixing.
>>>>   
>>>>> Also I hope dma->bitmap was actually allocated.  Not only does the
>>>>> START error path potentially leave dirty tracking enabled without all
>>>>> the bitmap allocated, when does the bitmap get allocated for a new
>>>>> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
>>>>> vpfn gets marked dirty.
>>>>>       
>>>>
>>>> Right.
>>>>
>>>> Fixing error paths.
>>>>
>>>>   
>>>>>> +		} else if (dma->bitmap) {
>>>>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
>>>>>> +			bool found = false;
>>>>>> +
>>>>>> +			for (; n; n = rb_next(n)) {
>>>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>>>> +						struct vfio_pfn, node);
>>>>>> +				if (vpfn->iova >= i) {
>>>>>> +					found = true;
>>>>>> +					break;
>>>>>> +				}
>>>>>> +			}
>>>>>> +
>>>>>> +			if (!found) {
>>>>>> +				i += dma->size;
>>>>>> +				continue;
>>>>>> +			}
>>>>>> +
>>>>>> +			for (; n; n = rb_next(n)) {
>>>>>> +				unsigned int s;
>>>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>>>> +						struct vfio_pfn, node);
>>>>>> +
>>>>>> +				if (vpfn->iova >= iova + size)
>>>>>> +					break;
>>>>>> +
>>>>>> +				s = (vpfn->iova - dma->iova) >> pgshift;
>>>>>> +				bitmap_set(dma->bitmap, s, 1);
>>>>>> +
>>>>>> +				iova_limit = vpfn->iova + pgsize;
>>>>>> +			}
>>>>>> +			npages = iova_limit/pgsize;
>>>>>
>>>>> Isn't iova_limit potentially uninitialized here?  For example, if our
>>>>> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
>>>>> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
>>>>> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
>>>>> 4096 and break, and npages = ????/pgsize.
>>>>>       
>>>>
>>>> Right, Fixing it.
>>>>   
>>>>>> +		}
>>>>>> +
>>>>>> +		bsize = dirty_bitmap_bytes(npages);
>>>>>> +		shift = nbits % BITS_PER_BYTE;
>>>>>> +
>>>>>> +		if (npages && shift) {
>>>>>> +			l--;
>>>>>> +			if (!access_ok((void __user *)bitmap + l,
>>>>>> +					sizeof(unsigned char)))
>>>>>> +				return -EINVAL;
>>>>>> +
>>>>>> +			ret = __get_user(temp, bitmap + l);
>>>>>
>>>>> I don't understand why we care to get the user's bitmap, are we trying
>>>>> to leave whatever garbage they might have set in it and only also set
>>>>> the dirty bits?  That seems unnecessary.
>>>>>       
>>>>
>>>> Suppose dma mapped ranges are {start, size}:
>>>> {0, 0xa000}, {0xa000, 0x10000}
>>>>
>>>> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
>>>> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10
>>>> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
>>>> In second iteration for dma {0xa000, 0x10000} there are 6 pages and
>>>> these bits should be appended to previous byte. So get_user() that byte,
>>>> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
>>>>
>>>> Without get_user() and shift-OR, resulting bitmap would be
>>>> 111111 00000011 11111111b which would be wrong.
>>>
>>> Seems like if we use a put_user() approach then we should look for
>>> adjacent vfio_dmas within the same byte/word/dword before we push it to
>>> the user to avoid this sort of inefficiency.
>>>    
>>
>> Won't that add more complication to logic?
> 
> I'm tempted to think it might be less complicated.
>   
>>>>> Also why do we need these access_ok() checks when we already checked
>>>>> the range at the start of the ioctl?
>>>>
>>>> Since pointer is updated runtime here, better to check that pointer
>>>> before using that pointer.
>>>
>>> Sorry, I still don't understand this, we check access_ok() with a
>>> pointer and a length, therefore as long as we're incrementing the
>>> pointer within that length, why do we need to retest?
>>>    
>>
>> Ideally caller for put_user() and get_user() must check the pointer with
>> access_ok() which is used as argument to these functions before calling
>> this function. That makes sure that pointer is correct after pointer
>> arithematic. May be lets remove previous check of pointer and length,
>> but keep these checks.
> 
> So we don't trust that we can increment a pointer within a range that
> we've already tested with access_ok() and expect it to still be ok?  I
> think the point of having access_ok() and __put_user() is that we can
> batch many __put_user() calls under a single access_ok() check.  I
> don't see any justification here why if we already tested
> access_ok(ptr, 2) that we still need to test access_ok(ptr + 0, 1) and
> access_ok(ptr + 1, 1), and removing the initial test is clearly the
> wrong optimization if we agree there is redundancy here.	
> 

access_ok(ptr + x, 1), where x is variable, then x shouldn't be out of 
range. If we go with initial test, then there should be check for x, 
such that x is within range.

>>>>>> +			if (ret)
>>>>>> +				return ret;
>>>>>> +		}
>>>>>> +
>>>>>> +		for (j = 0; j < bsize; j++, l++) {
>>>>>> +			temp = temp |
>>>>>> +			       (*((unsigned char *)dma->bitmap + j) << shift);
>>>>>
>>>>> |=
>>>>>       
>>>>>> +			if (!access_ok((void __user *)bitmap + l,
>>>>>> +					sizeof(unsigned char)))
>>>>>> +				return -EINVAL;
>>>>>> +
>>>>>> +			ret = __put_user(temp, bitmap + l);
>>>>>> +			if (ret)
>>>>>> +				return ret;
>>>>>> +			if (shift) {
>>>>>> +				temp = *((unsigned char *)dma->bitmap + j) >>
>>>>>> +					(BITS_PER_BYTE - shift);
>>>>>> +			}
>>>>>
>>>>> When shift == 0, temp just seems to accumulate bits that never get
>>>>> cleared.
>>>>>       
>>>>
>>>> Hope example above explains the shift logic.
>>>
>>> But that example is when shift is non-zero.  When shift is zero, each
>>> iteration of the loop just ORs in new bits to temp without ever
>>> clearing the bits for the previous iteration.
>>>
>>>    
>>
>> Oh right, fixing it.
>>
>>>>>> +		}
>>>>>> +
>>>>>> +		nbits += npages;
>>>>>> +
>>>>>> +		i = min(dma->iova + dma->size, iova + size);
>>>>>> +		if (i >= iova + size)
>>>>>> +			break;
>>>>>
>>>>> So whether we error or succeed, we leave cruft in dma->bitmap for the
>>>>> next pass.  It doesn't seem to make any sense why we pre-allocated the
>>>>> bitmap, we might as well just allocate it on demand here.  Actually, if
>>>>> we're not going to do a copy_to_user() for some range of the bitmap,
>>>>> I'm not sure what it's purpose is at all.  I think the big advantages
>>>>> of the bitmap are that we can't amortize the cost across every pinned
>>>>> page or DMA mapping, we don't need the overhead of tracking unmapped
>>>>> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
>>>>> getting any of those advantages here.
>>>>>       
>>>>
>>>> That would still not work if dma range size is not multiples of 8 pages.
>>>> See example above.
>>>
>>> I don't understand this comment, what about the example above justifies
>>> the bitmap?
>>
>> copy_to_user() could be used if dma range size is not multiple of 8 pages.
> 
> s/is not/is/ ?
> 

My bad, you're right.

> And we expect that to be a far more common case, right?  I don't think
> there are too many ranges for a guest that are only mapped in sub-32KB
> chucks.
>   
>>>   As I understand the above algorithm, we find a vfio_dma
>>> overlapping the request and populate the bitmap for that range.  Then
>>> we go back and put_user() for each byte that we touched.  We could
>>> instead simply work on a one byte buffer as we enumerate the requested
>>> range and do a put_user() ever time we reach the end of it and have bits
>>> set. That would greatly simplify the above example.  But I would expect
>>> that we're a) more likely to get asked for ranges covering a single
>>> vfio_dma
>>
>> QEMU ask for single vfio_dma during each iteration.
>>
>> If we restrict this ABI to cover single vfio_dma only, then it
>> simplifies the logic here. That was my original suggestion. Should we
>> think about that again?
> 
> But we currently allow unmaps that overlap multiple vfio_dmas as long
> as no vfio_dma is bisected, so I think that implies that an unmap while
> asking for the dirty bitmap has even further restricted semantics.  I'm
> also reluctant to design an ABI around what happens to be the current
> QEMU implementation.
> 
> If we take your example above, ranges {0x0000,0xa000} and
> {0xa000,0x10000} ({start,end}), I think you're working with the
> following two bitmaps in this implementation:
> 
> 00000011 11111111b
> 00111111b
> 
> And we need to combine those into:
> 
> 11111111 11111111b
> 
> Right?
> 
> But it seems like that would be easier if the second bitmap was instead:
> 
> 11111100b
> 
> Then we wouldn't need to worry about the entire bitmap being shifted by
> the bit offset within the byte, which limits our fixes to the boundary
> byte and allows us to use copy_to_user() directly for the bulk of the
> copy.  So how do we get there?
> 
> I think we start with allocating the vfio_dma bitmap to account for
> this initial offset, so we calculate bitmap_base_iova as:
>    (iova & ~((PAGE_SIZE << 3) - 1))
> We then use bitmap_base_iova in calculating which bits to set.
> 
> The user needs to follow the same rules, and maybe this adds some value
> to the user providing the bitmap size rather than the kernel
> calculating it.  For example, if the user wanted the dirty bitmap for
> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> 
> Effectively the user can ask for any iova range, but the buffer will be
> filled relative to the zeroth bit of the bitmap following the above
> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> requested pgsize).  I'm tempted to make this explicit in the user
> interface (ie. only allow bitmaps starting on aligned pages), but a
> user is able to map and unmap single pages and we need to support
> returning a dirty bitmap with an unmap, so I don't think we can do that.
> 

Sigh, finding adjacent vfio_dmas within the same byte seems simpler than 
this.

> So now are we biting off more than we can chew trying to transpose the
> bitmap between page sizes?  If asked for the previous range with an 8K
> pgsize, we'd somehow need to translate 11111100b into 00001110b.
> What's worse, the user could ask for just the 8K page at 0xa000 and we'd
> need to return back 00000010b while leaving our internal bitmap a
> 11110000b after we mark the bits clean.  Seems like this is really
> only tenable if we do multiples of PAGE_SIZE pages within a byte, so
> for 4K we'd have 32K, 64K, 128K, 256K, etc.  I'm somewhat losing sight
> on what this accomplishes though and whether we need this in the first
> implementation.  Should we simplify by dropping this aspect of it,
> supporting only the minimum iommu page size, and focus on actually
> using the bitmaps effectively?
>   

Sure, this will help to push first implementation, we can add 
optimization later.

>>> and b) we're going to spend far more time operating in the
>>> middle of the range and limiting ourselves to one-byte operations there
>>> seems absurd.  If we want to specify that the user provides 4-byte
>>> aligned buffers and naturally aligned iova ranges to make our lives
>>> easier in the kernel, now would be the time to do that.
>>>    
>>>>>> +	}
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>>>>>> +{
>>>>>> +	long bsize;
>>>>>> +
>>>>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	bsize = dirty_bitmap_bytes(npages);
>>>>>> +
>>>>>> +	if (bitmap_size < bsize)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	return bsize;
>>>>>> +}
>>>>>
>>>>> Seems like this could simply return int, -errno or zero for success.
>>>>> The returned bsize is not used for anything else.
>>>>>       
>>>>
>>>> ok.
>>>>   
>>>>>> +
>>>>>>     static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>>>     			     struct vfio_iommu_type1_dma_unmap *unmap)
>>>>>>     {
>>>>>> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>>>     
>>>>>>     		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>>>     			-EFAULT : 0;
>>>>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>>>>>> +		struct vfio_iommu_type1_dirty_bitmap range;
>>>>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>>>>>> +		int ret;
>>>>>> +
>>>>>> +		if (!iommu->v2)
>>>>>> +			return -EACCES;
>>>>>> +
>>>>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>>>>>> +				    bitmap);
>>>>>
>>>>> We require the user to provide iova, size, pgsize, bitmap_size, and
>>>>> bitmap fields to START/STOP?  Why?
>>>>>      
>>>>
>>>> No. But those are part of structure.
>>>
>>> But we do require it, minsz here includes all those fields, which would
>>> probably make a user scratch their head wondering why they need to pass
>>> irrelevant data for START/STOP.  It almost implies that we support
>>> starting and stopping dirty logging for specific ranges of the IOVA
>>> space.  We could define the structure, for example:
>>>
>>> struct vfio_iommu_type1_dirty_bitmap {
>>> 	__u32	argsz;
>>> 	__u32	flags;
>>> 	__u8	data[];
>>> };
>>>
>>> struct vfio_iommu_type1_dirty_bitmap_get {
>>> 	__u64	iova;
>>> 	__u64	size;
>>> 	__u64	pgsize;
>>> 	__u64	bitmap_size;
>>> 	void __user *bitmap;
>>> };
>>>
>>> Where data[] is defined as the latter structure when FLAG_GET_BITMAP is
>>> specified.
>>
>> Ok. Changing as above.
>>
>>>   BTW, don't we need to specify the trailing void* as __u64?
>>> We could theoretically be talking to an ILP32 user process.  Thanks,
>>>    
>>
>> Even on ILP32, using void* pointer will reserve the size required to
>> save a pointer address. I don't think using void* should be problem.
> 
> I think you're still assuming sizeof(void *) is the same in kernel vs
> userspace whereas I'm thinking about an ILP32 user running on an LP64
> kernel.  Thanks,
> 
Ok. Changing it to __u64

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-17 19:13               ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-17 19:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue



On 2/14/2020 4:50 AM, Alex Williamson wrote:
> On Fri, 14 Feb 2020 01:41:35 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> <snip>
>>
>>>>>>     
>>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>>>>>> +				  size_t size, uint64_t pgsize,
>>>>>> +				  unsigned char __user *bitmap)
>>>>>> +{
>>>>>> +	struct vfio_dma *dma;
>>>>>> +	dma_addr_t i = iova, iova_limit;
>>>>>> +	unsigned int bsize, nbits = 0, l = 0;
>>>>>> +	unsigned long pgshift = __ffs(pgsize);
>>>>>> +
>>>>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
>>>>>> +		int ret, j;
>>>>>> +		unsigned int npages = 0, shift = 0;
>>>>>> +		unsigned char temp = 0;
>>>>>> +
>>>>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
>>>>>> +		if (dma->iommu_mapped) {
>>>>>> +			iova_limit = min(dma->iova + dma->size, iova + size);
>>>>>> +			npages = iova_limit/pgsize;
>>>>>> +			bitmap_set(dma->bitmap, 0, npages);
>>>>>
>>>>> npages is derived from iova_limit, which is the number of bits to set
>>>>> dirty relative to the first requested iova, not iova zero, ie. the set
>>>>> of dirty bits is offset from those requested unless iova == dma->iova.
>>>>>       
>>>>
>>>> Right, fixing.
>>>>   
>>>>> Also I hope dma->bitmap was actually allocated.  Not only does the
>>>>> START error path potentially leave dirty tracking enabled without all
>>>>> the bitmap allocated, when does the bitmap get allocated for a new
>>>>> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
>>>>> vpfn gets marked dirty.
>>>>>       
>>>>
>>>> Right.
>>>>
>>>> Fixing error paths.
>>>>
>>>>   
>>>>>> +		} else if (dma->bitmap) {
>>>>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
>>>>>> +			bool found = false;
>>>>>> +
>>>>>> +			for (; n; n = rb_next(n)) {
>>>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>>>> +						struct vfio_pfn, node);
>>>>>> +				if (vpfn->iova >= i) {
>>>>>> +					found = true;
>>>>>> +					break;
>>>>>> +				}
>>>>>> +			}
>>>>>> +
>>>>>> +			if (!found) {
>>>>>> +				i += dma->size;
>>>>>> +				continue;
>>>>>> +			}
>>>>>> +
>>>>>> +			for (; n; n = rb_next(n)) {
>>>>>> +				unsigned int s;
>>>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
>>>>>> +						struct vfio_pfn, node);
>>>>>> +
>>>>>> +				if (vpfn->iova >= iova + size)
>>>>>> +					break;
>>>>>> +
>>>>>> +				s = (vpfn->iova - dma->iova) >> pgshift;
>>>>>> +				bitmap_set(dma->bitmap, s, 1);
>>>>>> +
>>>>>> +				iova_limit = vpfn->iova + pgsize;
>>>>>> +			}
>>>>>> +			npages = iova_limit/pgsize;
>>>>>
>>>>> Isn't iova_limit potentially uninitialized here?  For example, if our
>>>>> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
>>>>> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
>>>>> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
>>>>> 4096 and break, and npages = ????/pgsize.
>>>>>       
>>>>
>>>> Right, Fixing it.
>>>>   
>>>>>> +		}
>>>>>> +
>>>>>> +		bsize = dirty_bitmap_bytes(npages);
>>>>>> +		shift = nbits % BITS_PER_BYTE;
>>>>>> +
>>>>>> +		if (npages && shift) {
>>>>>> +			l--;
>>>>>> +			if (!access_ok((void __user *)bitmap + l,
>>>>>> +					sizeof(unsigned char)))
>>>>>> +				return -EINVAL;
>>>>>> +
>>>>>> +			ret = __get_user(temp, bitmap + l);
>>>>>
>>>>> I don't understand why we care to get the user's bitmap, are we trying
>>>>> to leave whatever garbage they might have set in it and only also set
>>>>> the dirty bits?  That seems unnecessary.
>>>>>       
>>>>
>>>> Suppose dma mapped ranges are {start, size}:
>>>> {0, 0xa000}, {0xa000, 0x10000}
>>>>
>>>> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
>>>> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10
>>>> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
>>>> In second iteration for dma {0xa000, 0x10000} there are 6 pages and
>>>> these bits should be appended to previous byte. So get_user() that byte,
>>>> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
>>>>
>>>> Without get_user() and shift-OR, resulting bitmap would be
>>>> 111111 00000011 11111111b which would be wrong.
>>>
>>> Seems like if we use a put_user() approach then we should look for
>>> adjacent vfio_dmas within the same byte/word/dword before we push it to
>>> the user to avoid this sort of inefficiency.
>>>    
>>
>> Won't that add more complication to logic?
> 
> I'm tempted to think it might be less complicated.
>   
>>>>> Also why do we need these access_ok() checks when we already checked
>>>>> the range at the start of the ioctl?
>>>>
>>>> Since pointer is updated runtime here, better to check that pointer
>>>> before using that pointer.
>>>
>>> Sorry, I still don't understand this, we check access_ok() with a
>>> pointer and a length, therefore as long as we're incrementing the
>>> pointer within that length, why do we need to retest?
>>>    
>>
>> Ideally caller for put_user() and get_user() must check the pointer with
>> access_ok() which is used as argument to these functions before calling
>> this function. That makes sure that pointer is correct after pointer
>> arithematic. May be lets remove previous check of pointer and length,
>> but keep these checks.
> 
> So we don't trust that we can increment a pointer within a range that
> we've already tested with access_ok() and expect it to still be ok?  I
> think the point of having access_ok() and __put_user() is that we can
> batch many __put_user() calls under a single access_ok() check.  I
> don't see any justification here why if we already tested
> access_ok(ptr, 2) that we still need to test access_ok(ptr + 0, 1) and
> access_ok(ptr + 1, 1), and removing the initial test is clearly the
> wrong optimization if we agree there is redundancy here.	
> 

access_ok(ptr + x, 1), where x is variable, then x shouldn't be out of 
range. If we go with initial test, then there should be check for x, 
such that x is within range.

>>>>>> +			if (ret)
>>>>>> +				return ret;
>>>>>> +		}
>>>>>> +
>>>>>> +		for (j = 0; j < bsize; j++, l++) {
>>>>>> +			temp = temp |
>>>>>> +			       (*((unsigned char *)dma->bitmap + j) << shift);
>>>>>
>>>>> |=
>>>>>       
>>>>>> +			if (!access_ok((void __user *)bitmap + l,
>>>>>> +					sizeof(unsigned char)))
>>>>>> +				return -EINVAL;
>>>>>> +
>>>>>> +			ret = __put_user(temp, bitmap + l);
>>>>>> +			if (ret)
>>>>>> +				return ret;
>>>>>> +			if (shift) {
>>>>>> +				temp = *((unsigned char *)dma->bitmap + j) >>
>>>>>> +					(BITS_PER_BYTE - shift);
>>>>>> +			}
>>>>>
>>>>> When shift == 0, temp just seems to accumulate bits that never get
>>>>> cleared.
>>>>>       
>>>>
>>>> Hope example above explains the shift logic.
>>>
>>> But that example is when shift is non-zero.  When shift is zero, each
>>> iteration of the loop just ORs in new bits to temp without ever
>>> clearing the bits for the previous iteration.
>>>
>>>    
>>
>> Oh right, fixing it.
>>
>>>>>> +		}
>>>>>> +
>>>>>> +		nbits += npages;
>>>>>> +
>>>>>> +		i = min(dma->iova + dma->size, iova + size);
>>>>>> +		if (i >= iova + size)
>>>>>> +			break;
>>>>>
>>>>> So whether we error or succeed, we leave cruft in dma->bitmap for the
>>>>> next pass.  It doesn't seem to make any sense why we pre-allocated the
>>>>> bitmap, we might as well just allocate it on demand here.  Actually, if
>>>>> we're not going to do a copy_to_user() for some range of the bitmap,
>>>>> I'm not sure what it's purpose is at all.  I think the big advantages
>>>>> of the bitmap are that we can't amortize the cost across every pinned
>>>>> page or DMA mapping, we don't need the overhead of tracking unmapped
>>>>> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
>>>>> getting any of those advantages here.
>>>>>       
>>>>
>>>> That would still not work if dma range size is not multiples of 8 pages.
>>>> See example above.
>>>
>>> I don't understand this comment, what about the example above justifies
>>> the bitmap?
>>
>> copy_to_user() could be used if dma range size is not multiple of 8 pages.
> 
> s/is not/is/ ?
> 

My bad, you're right.

> And we expect that to be a far more common case, right?  I don't think
> there are too many ranges for a guest that are only mapped in sub-32KB
> chucks.
>   
>>>   As I understand the above algorithm, we find a vfio_dma
>>> overlapping the request and populate the bitmap for that range.  Then
>>> we go back and put_user() for each byte that we touched.  We could
>>> instead simply work on a one byte buffer as we enumerate the requested
>>> range and do a put_user() ever time we reach the end of it and have bits
>>> set. That would greatly simplify the above example.  But I would expect
>>> that we're a) more likely to get asked for ranges covering a single
>>> vfio_dma
>>
>> QEMU ask for single vfio_dma during each iteration.
>>
>> If we restrict this ABI to cover single vfio_dma only, then it
>> simplifies the logic here. That was my original suggestion. Should we
>> think about that again?
> 
> But we currently allow unmaps that overlap multiple vfio_dmas as long
> as no vfio_dma is bisected, so I think that implies that an unmap while
> asking for the dirty bitmap has even further restricted semantics.  I'm
> also reluctant to design an ABI around what happens to be the current
> QEMU implementation.
> 
> If we take your example above, ranges {0x0000,0xa000} and
> {0xa000,0x10000} ({start,end}), I think you're working with the
> following two bitmaps in this implementation:
> 
> 00000011 11111111b
> 00111111b
> 
> And we need to combine those into:
> 
> 11111111 11111111b
> 
> Right?
> 
> But it seems like that would be easier if the second bitmap was instead:
> 
> 11111100b
> 
> Then we wouldn't need to worry about the entire bitmap being shifted by
> the bit offset within the byte, which limits our fixes to the boundary
> byte and allows us to use copy_to_user() directly for the bulk of the
> copy.  So how do we get there?
> 
> I think we start with allocating the vfio_dma bitmap to account for
> this initial offset, so we calculate bitmap_base_iova as:
>    (iova & ~((PAGE_SIZE << 3) - 1))
> We then use bitmap_base_iova in calculating which bits to set.
> 
> The user needs to follow the same rules, and maybe this adds some value
> to the user providing the bitmap size rather than the kernel
> calculating it.  For example, if the user wanted the dirty bitmap for
> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> 
> Effectively the user can ask for any iova range, but the buffer will be
> filled relative to the zeroth bit of the bitmap following the above
> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> requested pgsize).  I'm tempted to make this explicit in the user
> interface (ie. only allow bitmaps starting on aligned pages), but a
> user is able to map and unmap single pages and we need to support
> returning a dirty bitmap with an unmap, so I don't think we can do that.
> 

Sigh, finding adjacent vfio_dmas within the same byte seems simpler than 
this.

> So now are we biting off more than we can chew trying to transpose the
> bitmap between page sizes?  If asked for the previous range with an 8K
> pgsize, we'd somehow need to translate 11111100b into 00001110b.
> What's worse, the user could ask for just the 8K page at 0xa000 and we'd
> need to return back 00000010b while leaving our internal bitmap a
> 11110000b after we mark the bits clean.  Seems like this is really
> only tenable if we do multiples of PAGE_SIZE pages within a byte, so
> for 4K we'd have 32K, 64K, 128K, 256K, etc.  I'm somewhat losing sight
> on what this accomplishes though and whether we need this in the first
> implementation.  Should we simplify by dropping this aspect of it,
> supporting only the minimum iommu page size, and focus on actually
> using the bitmaps effectively?
>   

Sure, this will help to push first implementation, we can add 
optimization later.

>>> and b) we're going to spend far more time operating in the
>>> middle of the range and limiting ourselves to one-byte operations there
>>> seems absurd.  If we want to specify that the user provides 4-byte
>>> aligned buffers and naturally aligned iova ranges to make our lives
>>> easier in the kernel, now would be the time to do that.
>>>    
>>>>>> +	}
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>>>>>> +{
>>>>>> +	long bsize;
>>>>>> +
>>>>>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	bsize = dirty_bitmap_bytes(npages);
>>>>>> +
>>>>>> +	if (bitmap_size < bsize)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	return bsize;
>>>>>> +}
>>>>>
>>>>> Seems like this could simply return int, -errno or zero for success.
>>>>> The returned bsize is not used for anything else.
>>>>>       
>>>>
>>>> ok.
>>>>   
>>>>>> +
>>>>>>     static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>>>     			     struct vfio_iommu_type1_dma_unmap *unmap)
>>>>>>     {
>>>>>> @@ -2277,6 +2478,80 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>>>     
>>>>>>     		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>>>     			-EFAULT : 0;
>>>>>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>>>>>> +		struct vfio_iommu_type1_dirty_bitmap range;
>>>>>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>>>>>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>>>>>> +		int ret;
>>>>>> +
>>>>>> +		if (!iommu->v2)
>>>>>> +			return -EACCES;
>>>>>> +
>>>>>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>>>>>> +				    bitmap);
>>>>>
>>>>> We require the user to provide iova, size, pgsize, bitmap_size, and
>>>>> bitmap fields to START/STOP?  Why?
>>>>>      
>>>>
>>>> No. But those are part of structure.
>>>
>>> But we do require it, minsz here includes all those fields, which would
>>> probably make a user scratch their head wondering why they need to pass
>>> irrelevant data for START/STOP.  It almost implies that we support
>>> starting and stopping dirty logging for specific ranges of the IOVA
>>> space.  We could define the structure, for example:
>>>
>>> struct vfio_iommu_type1_dirty_bitmap {
>>> 	__u32	argsz;
>>> 	__u32	flags;
>>> 	__u8	data[];
>>> };
>>>
>>> struct vfio_iommu_type1_dirty_bitmap_get {
>>> 	__u64	iova;
>>> 	__u64	size;
>>> 	__u64	pgsize;
>>> 	__u64	bitmap_size;
>>> 	void __user *bitmap;
>>> };
>>>
>>> Where data[] is defined as the latter structure when FLAG_GET_BITMAP is
>>> specified.
>>
>> Ok. Changing as above.
>>
>>>   BTW, don't we need to specify the trailing void* as __u64?
>>> We could theoretically be talking to an ILP32 user process.  Thanks,
>>>    
>>
>> Even on ILP32, using void* pointer will reserve the size required to
>> save a pointer address. I don't think using void* should be problem.
> 
> I think you're still assuming sizeof(void *) is the same in kernel vs
> userspace whereas I'm thinking about an ILP32 user running on an LP64
> kernel.  Thanks,
> 
Ok. Changing it to __u64

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-17 19:13               ` Kirti Wankhede
@ 2020-02-17 20:55                 ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-17 20:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 18 Feb 2020 00:43:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 2/14/2020 4:50 AM, Alex Williamson wrote:
> > On Fri, 14 Feb 2020 01:41:35 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> <snip>
> >>  
> >>>>>>     
> >>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> >>>>>> +				  size_t size, uint64_t pgsize,
> >>>>>> +				  unsigned char __user *bitmap)
> >>>>>> +{
> >>>>>> +	struct vfio_dma *dma;
> >>>>>> +	dma_addr_t i = iova, iova_limit;
> >>>>>> +	unsigned int bsize, nbits = 0, l = 0;
> >>>>>> +	unsigned long pgshift = __ffs(pgsize);
> >>>>>> +
> >>>>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> >>>>>> +		int ret, j;
> >>>>>> +		unsigned int npages = 0, shift = 0;
> >>>>>> +		unsigned char temp = 0;
> >>>>>> +
> >>>>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
> >>>>>> +		if (dma->iommu_mapped) {
> >>>>>> +			iova_limit = min(dma->iova + dma->size, iova + size);
> >>>>>> +			npages = iova_limit/pgsize;
> >>>>>> +			bitmap_set(dma->bitmap, 0, npages);  
> >>>>>
> >>>>> npages is derived from iova_limit, which is the number of bits to set
> >>>>> dirty relative to the first requested iova, not iova zero, ie. the set
> >>>>> of dirty bits is offset from those requested unless iova == dma->iova.
> >>>>>         
> >>>>
> >>>> Right, fixing.
> >>>>     
> >>>>> Also I hope dma->bitmap was actually allocated.  Not only does the
> >>>>> START error path potentially leave dirty tracking enabled without all
> >>>>> the bitmap allocated, when does the bitmap get allocated for a new
> >>>>> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
> >>>>> vpfn gets marked dirty.
> >>>>>         
> >>>>
> >>>> Right.
> >>>>
> >>>> Fixing error paths.
> >>>>
> >>>>     
> >>>>>> +		} else if (dma->bitmap) {
> >>>>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
> >>>>>> +			bool found = false;
> >>>>>> +
> >>>>>> +			for (; n; n = rb_next(n)) {
> >>>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>>>> +						struct vfio_pfn, node);
> >>>>>> +				if (vpfn->iova >= i) {
> >>>>>> +					found = true;
> >>>>>> +					break;
> >>>>>> +				}
> >>>>>> +			}
> >>>>>> +
> >>>>>> +			if (!found) {
> >>>>>> +				i += dma->size;
> >>>>>> +				continue;
> >>>>>> +			}
> >>>>>> +
> >>>>>> +			for (; n; n = rb_next(n)) {
> >>>>>> +				unsigned int s;
> >>>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>>>> +						struct vfio_pfn, node);
> >>>>>> +
> >>>>>> +				if (vpfn->iova >= iova + size)
> >>>>>> +					break;
> >>>>>> +
> >>>>>> +				s = (vpfn->iova - dma->iova) >> pgshift;
> >>>>>> +				bitmap_set(dma->bitmap, s, 1);
> >>>>>> +
> >>>>>> +				iova_limit = vpfn->iova + pgsize;
> >>>>>> +			}
> >>>>>> +			npages = iova_limit/pgsize;  
> >>>>>
> >>>>> Isn't iova_limit potentially uninitialized here?  For example, if our
> >>>>> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
> >>>>> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
> >>>>> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
> >>>>> 4096 and break, and npages = ????/pgsize.
> >>>>>         
> >>>>
> >>>> Right, Fixing it.
> >>>>     
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		bsize = dirty_bitmap_bytes(npages);
> >>>>>> +		shift = nbits % BITS_PER_BYTE;
> >>>>>> +
> >>>>>> +		if (npages && shift) {
> >>>>>> +			l--;
> >>>>>> +			if (!access_ok((void __user *)bitmap + l,
> >>>>>> +					sizeof(unsigned char)))
> >>>>>> +				return -EINVAL;
> >>>>>> +
> >>>>>> +			ret = __get_user(temp, bitmap + l);  
> >>>>>
> >>>>> I don't understand why we care to get the user's bitmap, are we trying
> >>>>> to leave whatever garbage they might have set in it and only also set
> >>>>> the dirty bits?  That seems unnecessary.
> >>>>>         
> >>>>
> >>>> Suppose dma mapped ranges are {start, size}:
> >>>> {0, 0xa000}, {0xa000, 0x10000}
> >>>>
> >>>> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
> >>>> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10
> >>>> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
> >>>> In second iteration for dma {0xa000, 0x10000} there are 6 pages and
> >>>> these bits should be appended to previous byte. So get_user() that byte,
> >>>> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
> >>>>
> >>>> Without get_user() and shift-OR, resulting bitmap would be
> >>>> 111111 00000011 11111111b which would be wrong.  
> >>>
> >>> Seems like if we use a put_user() approach then we should look for
> >>> adjacent vfio_dmas within the same byte/word/dword before we push it to
> >>> the user to avoid this sort of inefficiency.
> >>>      
> >>
> >> Won't that add more complication to logic?  
> > 
> > I'm tempted to think it might be less complicated.
> >     
> >>>>> Also why do we need these access_ok() checks when we already checked
> >>>>> the range at the start of the ioctl?  
> >>>>
> >>>> Since pointer is updated runtime here, better to check that pointer
> >>>> before using that pointer.  
> >>>
> >>> Sorry, I still don't understand this, we check access_ok() with a
> >>> pointer and a length, therefore as long as we're incrementing the
> >>> pointer within that length, why do we need to retest?
> >>>      
> >>
> >> Ideally caller for put_user() and get_user() must check the pointer with
> >> access_ok() which is used as argument to these functions before calling
> >> this function. That makes sure that pointer is correct after pointer
> >> arithematic. May be lets remove previous check of pointer and length,
> >> but keep these checks.  
> > 
> > So we don't trust that we can increment a pointer within a range that
> > we've already tested with access_ok() and expect it to still be ok?  I
> > think the point of having access_ok() and __put_user() is that we can
> > batch many __put_user() calls under a single access_ok() check.  I
> > don't see any justification here why if we already tested
> > access_ok(ptr, 2) that we still need to test access_ok(ptr + 0, 1) and
> > access_ok(ptr + 1, 1), and removing the initial test is clearly the
> > wrong optimization if we agree there is redundancy here.	
> >   
> 
> access_ok(ptr + x, 1), where x is variable, then x shouldn't be out of 
> range. If we go with initial test, then there should be check for x, 
> such that x is within range.

That logic should already exist though, we shouldn't be trying to fill
a bitmap beyond what the user requested and therefore what we've
already tested that it's sized for and we have access to.
 
> >>>>>> +			if (ret)
> >>>>>> +				return ret;
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		for (j = 0; j < bsize; j++, l++) {
> >>>>>> +			temp = temp |
> >>>>>> +			       (*((unsigned char *)dma->bitmap + j) << shift);  
> >>>>>
> >>>>> |=
> >>>>>         
> >>>>>> +			if (!access_ok((void __user *)bitmap + l,
> >>>>>> +					sizeof(unsigned char)))
> >>>>>> +				return -EINVAL;
> >>>>>> +
> >>>>>> +			ret = __put_user(temp, bitmap + l);
> >>>>>> +			if (ret)
> >>>>>> +				return ret;
> >>>>>> +			if (shift) {
> >>>>>> +				temp = *((unsigned char *)dma->bitmap + j) >>
> >>>>>> +					(BITS_PER_BYTE - shift);
> >>>>>> +			}  
> >>>>>
> >>>>> When shift == 0, temp just seems to accumulate bits that never get
> >>>>> cleared.
> >>>>>         
> >>>>
> >>>> Hope example above explains the shift logic.  
> >>>
> >>> But that example is when shift is non-zero.  When shift is zero, each
> >>> iteration of the loop just ORs in new bits to temp without ever
> >>> clearing the bits for the previous iteration.
> >>>
> >>>      
> >>
> >> Oh right, fixing it.
> >>  
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		nbits += npages;
> >>>>>> +
> >>>>>> +		i = min(dma->iova + dma->size, iova + size);
> >>>>>> +		if (i >= iova + size)
> >>>>>> +			break;  
> >>>>>
> >>>>> So whether we error or succeed, we leave cruft in dma->bitmap for the
> >>>>> next pass.  It doesn't seem to make any sense why we pre-allocated the
> >>>>> bitmap, we might as well just allocate it on demand here.  Actually, if
> >>>>> we're not going to do a copy_to_user() for some range of the bitmap,
> >>>>> I'm not sure what it's purpose is at all.  I think the big advantages
> >>>>> of the bitmap are that we can't amortize the cost across every pinned
> >>>>> page or DMA mapping, we don't need the overhead of tracking unmapped
> >>>>> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
> >>>>> getting any of those advantages here.
> >>>>>         
> >>>>
> >>>> That would still not work if dma range size is not multiples of 8 pages.
> >>>> See example above.  
> >>>
> >>> I don't understand this comment, what about the example above justifies
> >>> the bitmap?  
> >>
> >> copy_to_user() could be used if dma range size is not multiple of 8 pages.  
> > 
> > s/is not/is/ ?
> >   
> 
> My bad, you're right.
> 
> > And we expect that to be a far more common case, right?  I don't think
> > there are too many ranges for a guest that are only mapped in sub-32KB
> > chucks.
> >     
> >>>   As I understand the above algorithm, we find a vfio_dma
> >>> overlapping the request and populate the bitmap for that range.  Then
> >>> we go back and put_user() for each byte that we touched.  We could
> >>> instead simply work on a one byte buffer as we enumerate the requested
> >>> range and do a put_user() ever time we reach the end of it and have bits
> >>> set. That would greatly simplify the above example.  But I would expect
> >>> that we're a) more likely to get asked for ranges covering a single
> >>> vfio_dma  
> >>
> >> QEMU ask for single vfio_dma during each iteration.
> >>
> >> If we restrict this ABI to cover single vfio_dma only, then it
> >> simplifies the logic here. That was my original suggestion. Should we
> >> think about that again?  
> > 
> > But we currently allow unmaps that overlap multiple vfio_dmas as long
> > as no vfio_dma is bisected, so I think that implies that an unmap while
> > asking for the dirty bitmap has even further restricted semantics.  I'm
> > also reluctant to design an ABI around what happens to be the current
> > QEMU implementation.
> > 
> > If we take your example above, ranges {0x0000,0xa000} and
> > {0xa000,0x10000} ({start,end}), I think you're working with the
> > following two bitmaps in this implementation:
> > 
> > 00000011 11111111b
> > 00111111b
> > 
> > And we need to combine those into:
> > 
> > 11111111 11111111b
> > 
> > Right?
> > 
> > But it seems like that would be easier if the second bitmap was instead:
> > 
> > 11111100b
> > 
> > Then we wouldn't need to worry about the entire bitmap being shifted by
> > the bit offset within the byte, which limits our fixes to the boundary
> > byte and allows us to use copy_to_user() directly for the bulk of the
> > copy.  So how do we get there?
> > 
> > I think we start with allocating the vfio_dma bitmap to account for
> > this initial offset, so we calculate bitmap_base_iova as:
> >    (iova & ~((PAGE_SIZE << 3) - 1))
> > We then use bitmap_base_iova in calculating which bits to set.
> > 
> > The user needs to follow the same rules, and maybe this adds some value
> > to the user providing the bitmap size rather than the kernel
> > calculating it.  For example, if the user wanted the dirty bitmap for
> > the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> > bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> > 
> > Effectively the user can ask for any iova range, but the buffer will be
> > filled relative to the zeroth bit of the bitmap following the above
> > bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> > requested pgsize).  I'm tempted to make this explicit in the user
> > interface (ie. only allow bitmaps starting on aligned pages), but a
> > user is able to map and unmap single pages and we need to support
> > returning a dirty bitmap with an unmap, so I don't think we can do that.
> >   
> 
> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than 
> this.

How does KVM do this?  My intent was that if all of our bitmaps share
the same alignment then we can merge the intersection and continue to
use copy_to_user() on either side.  However, if QEMU doesn't do the
same, it doesn't really help us.  Is QEMU stuck with an implementation
of only retrieving dirty bits per MemoryRegionSection exactly because
of this issue and therefore we can rely on it in our implementation as
well?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-17 20:55                 ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-17 20:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 18 Feb 2020 00:43:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 2/14/2020 4:50 AM, Alex Williamson wrote:
> > On Fri, 14 Feb 2020 01:41:35 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> <snip>
> >>  
> >>>>>>     
> >>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> >>>>>> +				  size_t size, uint64_t pgsize,
> >>>>>> +				  unsigned char __user *bitmap)
> >>>>>> +{
> >>>>>> +	struct vfio_dma *dma;
> >>>>>> +	dma_addr_t i = iova, iova_limit;
> >>>>>> +	unsigned int bsize, nbits = 0, l = 0;
> >>>>>> +	unsigned long pgshift = __ffs(pgsize);
> >>>>>> +
> >>>>>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> >>>>>> +		int ret, j;
> >>>>>> +		unsigned int npages = 0, shift = 0;
> >>>>>> +		unsigned char temp = 0;
> >>>>>> +
> >>>>>> +		/* mark all pages dirty if all pages are pinned and mapped. */
> >>>>>> +		if (dma->iommu_mapped) {
> >>>>>> +			iova_limit = min(dma->iova + dma->size, iova + size);
> >>>>>> +			npages = iova_limit/pgsize;
> >>>>>> +			bitmap_set(dma->bitmap, 0, npages);  
> >>>>>
> >>>>> npages is derived from iova_limit, which is the number of bits to set
> >>>>> dirty relative to the first requested iova, not iova zero, ie. the set
> >>>>> of dirty bits is offset from those requested unless iova == dma->iova.
> >>>>>         
> >>>>
> >>>> Right, fixing.
> >>>>     
> >>>>> Also I hope dma->bitmap was actually allocated.  Not only does the
> >>>>> START error path potentially leave dirty tracking enabled without all
> >>>>> the bitmap allocated, when does the bitmap get allocated for a new
> >>>>> vfio_dma when dirty tracking is enabled?  Seems it only occurs if a
> >>>>> vpfn gets marked dirty.
> >>>>>         
> >>>>
> >>>> Right.
> >>>>
> >>>> Fixing error paths.
> >>>>
> >>>>     
> >>>>>> +		} else if (dma->bitmap) {
> >>>>>> +			struct rb_node *n = rb_first(&dma->pfn_list);
> >>>>>> +			bool found = false;
> >>>>>> +
> >>>>>> +			for (; n; n = rb_next(n)) {
> >>>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>>>> +						struct vfio_pfn, node);
> >>>>>> +				if (vpfn->iova >= i) {
> >>>>>> +					found = true;
> >>>>>> +					break;
> >>>>>> +				}
> >>>>>> +			}
> >>>>>> +
> >>>>>> +			if (!found) {
> >>>>>> +				i += dma->size;
> >>>>>> +				continue;
> >>>>>> +			}
> >>>>>> +
> >>>>>> +			for (; n; n = rb_next(n)) {
> >>>>>> +				unsigned int s;
> >>>>>> +				struct vfio_pfn *vpfn = rb_entry(n,
> >>>>>> +						struct vfio_pfn, node);
> >>>>>> +
> >>>>>> +				if (vpfn->iova >= iova + size)
> >>>>>> +					break;
> >>>>>> +
> >>>>>> +				s = (vpfn->iova - dma->iova) >> pgshift;
> >>>>>> +				bitmap_set(dma->bitmap, s, 1);
> >>>>>> +
> >>>>>> +				iova_limit = vpfn->iova + pgsize;
> >>>>>> +			}
> >>>>>> +			npages = iova_limit/pgsize;  
> >>>>>
> >>>>> Isn't iova_limit potentially uninitialized here?  For example, if our
> >>>>> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and
> >>>>> there's a vpfn at {4096,8192}.  I think that means vpfn->iova >= i
> >>>>> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 +
> >>>>> 4096 and break, and npages = ????/pgsize.
> >>>>>         
> >>>>
> >>>> Right, Fixing it.
> >>>>     
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		bsize = dirty_bitmap_bytes(npages);
> >>>>>> +		shift = nbits % BITS_PER_BYTE;
> >>>>>> +
> >>>>>> +		if (npages && shift) {
> >>>>>> +			l--;
> >>>>>> +			if (!access_ok((void __user *)bitmap + l,
> >>>>>> +					sizeof(unsigned char)))
> >>>>>> +				return -EINVAL;
> >>>>>> +
> >>>>>> +			ret = __get_user(temp, bitmap + l);  
> >>>>>
> >>>>> I don't understand why we care to get the user's bitmap, are we trying
> >>>>> to leave whatever garbage they might have set in it and only also set
> >>>>> the dirty bits?  That seems unnecessary.
> >>>>>         
> >>>>
> >>>> Suppose dma mapped ranges are {start, size}:
> >>>> {0, 0xa000}, {0xa000, 0x10000}
> >>>>
> >>>> Bitmap asked from 0 - 0x10000. Say suppose all pages are dirty.
> >>>> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10
> >>>> bits are set, put_user() happens for 2 bytes, (00000011 11111111b).
> >>>> In second iteration for dma {0xa000, 0x10000} there are 6 pages and
> >>>> these bits should be appended to previous byte. So get_user() that byte,
> >>>> then shift-OR rest of the bitmap, result should be: (11111111 11111111b)
> >>>>
> >>>> Without get_user() and shift-OR, resulting bitmap would be
> >>>> 111111 00000011 11111111b which would be wrong.  
> >>>
> >>> Seems like if we use a put_user() approach then we should look for
> >>> adjacent vfio_dmas within the same byte/word/dword before we push it to
> >>> the user to avoid this sort of inefficiency.
> >>>      
> >>
> >> Won't that add more complication to logic?  
> > 
> > I'm tempted to think it might be less complicated.
> >     
> >>>>> Also why do we need these access_ok() checks when we already checked
> >>>>> the range at the start of the ioctl?  
> >>>>
> >>>> Since pointer is updated runtime here, better to check that pointer
> >>>> before using that pointer.  
> >>>
> >>> Sorry, I still don't understand this, we check access_ok() with a
> >>> pointer and a length, therefore as long as we're incrementing the
> >>> pointer within that length, why do we need to retest?
> >>>      
> >>
> >> Ideally caller for put_user() and get_user() must check the pointer with
> >> access_ok() which is used as argument to these functions before calling
> >> this function. That makes sure that pointer is correct after pointer
> >> arithematic. May be lets remove previous check of pointer and length,
> >> but keep these checks.  
> > 
> > So we don't trust that we can increment a pointer within a range that
> > we've already tested with access_ok() and expect it to still be ok?  I
> > think the point of having access_ok() and __put_user() is that we can
> > batch many __put_user() calls under a single access_ok() check.  I
> > don't see any justification here why if we already tested
> > access_ok(ptr, 2) that we still need to test access_ok(ptr + 0, 1) and
> > access_ok(ptr + 1, 1), and removing the initial test is clearly the
> > wrong optimization if we agree there is redundancy here.	
> >   
> 
> access_ok(ptr + x, 1), where x is variable, then x shouldn't be out of 
> range. If we go with initial test, then there should be check for x, 
> such that x is within range.

That logic should already exist though, we shouldn't be trying to fill
a bitmap beyond what the user requested and therefore what we've
already tested that it's sized for and we have access to.
 
> >>>>>> +			if (ret)
> >>>>>> +				return ret;
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		for (j = 0; j < bsize; j++, l++) {
> >>>>>> +			temp = temp |
> >>>>>> +			       (*((unsigned char *)dma->bitmap + j) << shift);  
> >>>>>
> >>>>> |=
> >>>>>         
> >>>>>> +			if (!access_ok((void __user *)bitmap + l,
> >>>>>> +					sizeof(unsigned char)))
> >>>>>> +				return -EINVAL;
> >>>>>> +
> >>>>>> +			ret = __put_user(temp, bitmap + l);
> >>>>>> +			if (ret)
> >>>>>> +				return ret;
> >>>>>> +			if (shift) {
> >>>>>> +				temp = *((unsigned char *)dma->bitmap + j) >>
> >>>>>> +					(BITS_PER_BYTE - shift);
> >>>>>> +			}  
> >>>>>
> >>>>> When shift == 0, temp just seems to accumulate bits that never get
> >>>>> cleared.
> >>>>>         
> >>>>
> >>>> Hope example above explains the shift logic.  
> >>>
> >>> But that example is when shift is non-zero.  When shift is zero, each
> >>> iteration of the loop just ORs in new bits to temp without ever
> >>> clearing the bits for the previous iteration.
> >>>
> >>>      
> >>
> >> Oh right, fixing it.
> >>  
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		nbits += npages;
> >>>>>> +
> >>>>>> +		i = min(dma->iova + dma->size, iova + size);
> >>>>>> +		if (i >= iova + size)
> >>>>>> +			break;  
> >>>>>
> >>>>> So whether we error or succeed, we leave cruft in dma->bitmap for the
> >>>>> next pass.  It doesn't seem to make any sense why we pre-allocated the
> >>>>> bitmap, we might as well just allocate it on demand here.  Actually, if
> >>>>> we're not going to do a copy_to_user() for some range of the bitmap,
> >>>>> I'm not sure what it's purpose is at all.  I think the big advantages
> >>>>> of the bitmap are that we can't amortize the cost across every pinned
> >>>>> page or DMA mapping, we don't need the overhead of tracking unmapped
> >>>>> vpfns, and we can use copy_to_user() to push the bitmap out.  We're not
> >>>>> getting any of those advantages here.
> >>>>>         
> >>>>
> >>>> That would still not work if dma range size is not multiples of 8 pages.
> >>>> See example above.  
> >>>
> >>> I don't understand this comment, what about the example above justifies
> >>> the bitmap?  
> >>
> >> copy_to_user() could be used if dma range size is not multiple of 8 pages.  
> > 
> > s/is not/is/ ?
> >   
> 
> My bad, you're right.
> 
> > And we expect that to be a far more common case, right?  I don't think
> > there are too many ranges for a guest that are only mapped in sub-32KB
> > chucks.
> >     
> >>>   As I understand the above algorithm, we find a vfio_dma
> >>> overlapping the request and populate the bitmap for that range.  Then
> >>> we go back and put_user() for each byte that we touched.  We could
> >>> instead simply work on a one byte buffer as we enumerate the requested
> >>> range and do a put_user() ever time we reach the end of it and have bits
> >>> set. That would greatly simplify the above example.  But I would expect
> >>> that we're a) more likely to get asked for ranges covering a single
> >>> vfio_dma  
> >>
> >> QEMU ask for single vfio_dma during each iteration.
> >>
> >> If we restrict this ABI to cover single vfio_dma only, then it
> >> simplifies the logic here. That was my original suggestion. Should we
> >> think about that again?  
> > 
> > But we currently allow unmaps that overlap multiple vfio_dmas as long
> > as no vfio_dma is bisected, so I think that implies that an unmap while
> > asking for the dirty bitmap has even further restricted semantics.  I'm
> > also reluctant to design an ABI around what happens to be the current
> > QEMU implementation.
> > 
> > If we take your example above, ranges {0x0000,0xa000} and
> > {0xa000,0x10000} ({start,end}), I think you're working with the
> > following two bitmaps in this implementation:
> > 
> > 00000011 11111111b
> > 00111111b
> > 
> > And we need to combine those into:
> > 
> > 11111111 11111111b
> > 
> > Right?
> > 
> > But it seems like that would be easier if the second bitmap was instead:
> > 
> > 11111100b
> > 
> > Then we wouldn't need to worry about the entire bitmap being shifted by
> > the bit offset within the byte, which limits our fixes to the boundary
> > byte and allows us to use copy_to_user() directly for the bulk of the
> > copy.  So how do we get there?
> > 
> > I think we start with allocating the vfio_dma bitmap to account for
> > this initial offset, so we calculate bitmap_base_iova as:
> >    (iova & ~((PAGE_SIZE << 3) - 1))
> > We then use bitmap_base_iova in calculating which bits to set.
> > 
> > The user needs to follow the same rules, and maybe this adds some value
> > to the user providing the bitmap size rather than the kernel
> > calculating it.  For example, if the user wanted the dirty bitmap for
> > the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> > bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> > 
> > Effectively the user can ask for any iova range, but the buffer will be
> > filled relative to the zeroth bit of the bitmap following the above
> > bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> > requested pgsize).  I'm tempted to make this explicit in the user
> > interface (ie. only allow bitmaps starting on aligned pages), but a
> > user is able to map and unmap single pages and we need to support
> > returning a dirty bitmap with an unmap, so I don't think we can do that.
> >   
> 
> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than 
> this.

How does KVM do this?  My intent was that if all of our bitmaps share
the same alignment then we can merge the intersection and continue to
use copy_to_user() on either side.  However, if QEMU doesn't do the
same, it doesn't really help us.  Is QEMU stuck with an implementation
of only retrieving dirty bits per MemoryRegionSection exactly because
of this issue and therefore we can rely on it in our implementation as
well?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-17 20:55                 ` Alex Williamson
@ 2020-02-18  5:58                   ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-18  5:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

<snip>

>>>>>    As I understand the above algorithm, we find a vfio_dma
>>>>> overlapping the request and populate the bitmap for that range.  Then
>>>>> we go back and put_user() for each byte that we touched.  We could
>>>>> instead simply work on a one byte buffer as we enumerate the requested
>>>>> range and do a put_user() ever time we reach the end of it and have bits
>>>>> set. That would greatly simplify the above example.  But I would expect
>>>>> that we're a) more likely to get asked for ranges covering a single
>>>>> vfio_dma
>>>>
>>>> QEMU ask for single vfio_dma during each iteration.
>>>>
>>>> If we restrict this ABI to cover single vfio_dma only, then it
>>>> simplifies the logic here. That was my original suggestion. Should we
>>>> think about that again?
>>>
>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
>>> as no vfio_dma is bisected, so I think that implies that an unmap while
>>> asking for the dirty bitmap has even further restricted semantics.  I'm
>>> also reluctant to design an ABI around what happens to be the current
>>> QEMU implementation.
>>>
>>> If we take your example above, ranges {0x0000,0xa000} and
>>> {0xa000,0x10000} ({start,end}), I think you're working with the
>>> following two bitmaps in this implementation:
>>>
>>> 00000011 11111111b
>>> 00111111b
>>>
>>> And we need to combine those into:
>>>
>>> 11111111 11111111b
>>>
>>> Right?
>>>
>>> But it seems like that would be easier if the second bitmap was instead:
>>>
>>> 11111100b
>>>
>>> Then we wouldn't need to worry about the entire bitmap being shifted by
>>> the bit offset within the byte, which limits our fixes to the boundary
>>> byte and allows us to use copy_to_user() directly for the bulk of the
>>> copy.  So how do we get there?
>>>
>>> I think we start with allocating the vfio_dma bitmap to account for
>>> this initial offset, so we calculate bitmap_base_iova as:
>>>     (iova & ~((PAGE_SIZE << 3) - 1))
>>> We then use bitmap_base_iova in calculating which bits to set.
>>>
>>> The user needs to follow the same rules, and maybe this adds some value
>>> to the user providing the bitmap size rather than the kernel
>>> calculating it.  For example, if the user wanted the dirty bitmap for
>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
>>>
>>> Effectively the user can ask for any iova range, but the buffer will be
>>> filled relative to the zeroth bit of the bitmap following the above
>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
>>> requested pgsize).  I'm tempted to make this explicit in the user
>>> interface (ie. only allow bitmaps starting on aligned pages), but a
>>> user is able to map and unmap single pages and we need to support
>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
>>>    
>>
>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
>> this.
> 
> How does KVM do this?  My intent was that if all of our bitmaps share
> the same alignment then we can merge the intersection and continue to
> use copy_to_user() on either side.  However, if QEMU doesn't do the
> same, it doesn't really help us.  Is QEMU stuck with an implementation
> of only retrieving dirty bits per MemoryRegionSection exactly because
> of this issue and therefore we can rely on it in our implementation as
> well?  Thanks,
> 

QEMU sync dirty_bitmap per MemoryRegionSection. Within 
MemoryRegionSection there could be multiple KVMSlots. QEMU queries 
dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
On kernel side, KVM_GET_DIRTY_LOG ioctl calls 
kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap 
of that memSlot.
vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection 
in our implementation. But to get bitmap during unmap, we have to take 
care of concatenating bitmaps.

In QEMU, in function kvm_physical_sync_dirty_bitmap() there is a comment 
where bitmap size is calculated and bitmap is defined as 'void __user 
*dirty_bitmap' which is also the concern you raised and could be handled 
similarly as below.

         /* XXX bad kernel interface alert
          * For dirty bitmap, kernel allocates array of size aligned to
          * bits-per-long.  But for case when the kernel is 64bits and
          * the userspace is 32bits, userspace can't align to the same
          * bits-per-long, since sizeof(long) is different between kernel
          * and user space.  This way, userspace will provide buffer which
          * may be 4 bytes less than the kernel will use, resulting in
          * userspace memory corruption (which is not detectable by valgrind
          * too, in most cases).
          * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
          * a hope that sizeof(long) won't become >8 any time soon.
          */
         if (!mem->dirty_bmap) {
             hwaddr bitmap_size = ALIGN(((mem->memory_size) >> 
TARGET_PAGE_BITS),
                                         /*HOST_LONG_BITS*/ 64) / 8;
             /* Allocate on the first log_sync, once and for all */
             mem->dirty_bmap = g_malloc0(bitmap_size);
         }

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-18  5:58                   ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-18  5:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

<snip>

>>>>>    As I understand the above algorithm, we find a vfio_dma
>>>>> overlapping the request and populate the bitmap for that range.  Then
>>>>> we go back and put_user() for each byte that we touched.  We could
>>>>> instead simply work on a one byte buffer as we enumerate the requested
>>>>> range and do a put_user() ever time we reach the end of it and have bits
>>>>> set. That would greatly simplify the above example.  But I would expect
>>>>> that we're a) more likely to get asked for ranges covering a single
>>>>> vfio_dma
>>>>
>>>> QEMU ask for single vfio_dma during each iteration.
>>>>
>>>> If we restrict this ABI to cover single vfio_dma only, then it
>>>> simplifies the logic here. That was my original suggestion. Should we
>>>> think about that again?
>>>
>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
>>> as no vfio_dma is bisected, so I think that implies that an unmap while
>>> asking for the dirty bitmap has even further restricted semantics.  I'm
>>> also reluctant to design an ABI around what happens to be the current
>>> QEMU implementation.
>>>
>>> If we take your example above, ranges {0x0000,0xa000} and
>>> {0xa000,0x10000} ({start,end}), I think you're working with the
>>> following two bitmaps in this implementation:
>>>
>>> 00000011 11111111b
>>> 00111111b
>>>
>>> And we need to combine those into:
>>>
>>> 11111111 11111111b
>>>
>>> Right?
>>>
>>> But it seems like that would be easier if the second bitmap was instead:
>>>
>>> 11111100b
>>>
>>> Then we wouldn't need to worry about the entire bitmap being shifted by
>>> the bit offset within the byte, which limits our fixes to the boundary
>>> byte and allows us to use copy_to_user() directly for the bulk of the
>>> copy.  So how do we get there?
>>>
>>> I think we start with allocating the vfio_dma bitmap to account for
>>> this initial offset, so we calculate bitmap_base_iova as:
>>>     (iova & ~((PAGE_SIZE << 3) - 1))
>>> We then use bitmap_base_iova in calculating which bits to set.
>>>
>>> The user needs to follow the same rules, and maybe this adds some value
>>> to the user providing the bitmap size rather than the kernel
>>> calculating it.  For example, if the user wanted the dirty bitmap for
>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
>>>
>>> Effectively the user can ask for any iova range, but the buffer will be
>>> filled relative to the zeroth bit of the bitmap following the above
>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
>>> requested pgsize).  I'm tempted to make this explicit in the user
>>> interface (ie. only allow bitmaps starting on aligned pages), but a
>>> user is able to map and unmap single pages and we need to support
>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
>>>    
>>
>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
>> this.
> 
> How does KVM do this?  My intent was that if all of our bitmaps share
> the same alignment then we can merge the intersection and continue to
> use copy_to_user() on either side.  However, if QEMU doesn't do the
> same, it doesn't really help us.  Is QEMU stuck with an implementation
> of only retrieving dirty bits per MemoryRegionSection exactly because
> of this issue and therefore we can rely on it in our implementation as
> well?  Thanks,
> 

QEMU sync dirty_bitmap per MemoryRegionSection. Within 
MemoryRegionSection there could be multiple KVMSlots. QEMU queries 
dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
On kernel side, KVM_GET_DIRTY_LOG ioctl calls 
kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap 
of that memSlot.
vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection 
in our implementation. But to get bitmap during unmap, we have to take 
care of concatenating bitmaps.

In QEMU, in function kvm_physical_sync_dirty_bitmap() there is a comment 
where bitmap size is calculated and bitmap is defined as 'void __user 
*dirty_bitmap' which is also the concern you raised and could be handled 
similarly as below.

         /* XXX bad kernel interface alert
          * For dirty bitmap, kernel allocates array of size aligned to
          * bits-per-long.  But for case when the kernel is 64bits and
          * the userspace is 32bits, userspace can't align to the same
          * bits-per-long, since sizeof(long) is different between kernel
          * and user space.  This way, userspace will provide buffer which
          * may be 4 bytes less than the kernel will use, resulting in
          * userspace memory corruption (which is not detectable by valgrind
          * too, in most cases).
          * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
          * a hope that sizeof(long) won't become >8 any time soon.
          */
         if (!mem->dirty_bmap) {
             hwaddr bitmap_size = ALIGN(((mem->memory_size) >> 
TARGET_PAGE_BITS),
                                         /*HOST_LONG_BITS*/ 64) / 8;
             /* Allocate on the first log_sync, once and for all */
             mem->dirty_bmap = g_malloc0(bitmap_size);
         }

Thanks,
Kirti



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-18  5:58                   ` Kirti Wankhede
@ 2020-02-18 21:41                     ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-18 21:41 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 18 Feb 2020 11:28:53 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> <snip>
> 
> >>>>>    As I understand the above algorithm, we find a vfio_dma
> >>>>> overlapping the request and populate the bitmap for that range.  Then
> >>>>> we go back and put_user() for each byte that we touched.  We could
> >>>>> instead simply work on a one byte buffer as we enumerate the requested
> >>>>> range and do a put_user() ever time we reach the end of it and have bits
> >>>>> set. That would greatly simplify the above example.  But I would expect
> >>>>> that we're a) more likely to get asked for ranges covering a single
> >>>>> vfio_dma  
> >>>>
> >>>> QEMU ask for single vfio_dma during each iteration.
> >>>>
> >>>> If we restrict this ABI to cover single vfio_dma only, then it
> >>>> simplifies the logic here. That was my original suggestion. Should we
> >>>> think about that again?  
> >>>
> >>> But we currently allow unmaps that overlap multiple vfio_dmas as long
> >>> as no vfio_dma is bisected, so I think that implies that an unmap while
> >>> asking for the dirty bitmap has even further restricted semantics.  I'm
> >>> also reluctant to design an ABI around what happens to be the current
> >>> QEMU implementation.
> >>>
> >>> If we take your example above, ranges {0x0000,0xa000} and
> >>> {0xa000,0x10000} ({start,end}), I think you're working with the
> >>> following two bitmaps in this implementation:
> >>>
> >>> 00000011 11111111b
> >>> 00111111b
> >>>
> >>> And we need to combine those into:
> >>>
> >>> 11111111 11111111b
> >>>
> >>> Right?
> >>>
> >>> But it seems like that would be easier if the second bitmap was instead:
> >>>
> >>> 11111100b
> >>>
> >>> Then we wouldn't need to worry about the entire bitmap being shifted by
> >>> the bit offset within the byte, which limits our fixes to the boundary
> >>> byte and allows us to use copy_to_user() directly for the bulk of the
> >>> copy.  So how do we get there?
> >>>
> >>> I think we start with allocating the vfio_dma bitmap to account for
> >>> this initial offset, so we calculate bitmap_base_iova as:
> >>>     (iova & ~((PAGE_SIZE << 3) - 1))
> >>> We then use bitmap_base_iova in calculating which bits to set.
> >>>
> >>> The user needs to follow the same rules, and maybe this adds some value
> >>> to the user providing the bitmap size rather than the kernel
> >>> calculating it.  For example, if the user wanted the dirty bitmap for
> >>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> >>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> >>>
> >>> Effectively the user can ask for any iova range, but the buffer will be
> >>> filled relative to the zeroth bit of the bitmap following the above
> >>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> >>> requested pgsize).  I'm tempted to make this explicit in the user
> >>> interface (ie. only allow bitmaps starting on aligned pages), but a
> >>> user is able to map and unmap single pages and we need to support
> >>> returning a dirty bitmap with an unmap, so I don't think we can do that.
> >>>      
> >>
> >> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
> >> this.  
> > 
> > How does KVM do this?  My intent was that if all of our bitmaps share
> > the same alignment then we can merge the intersection and continue to
> > use copy_to_user() on either side.  However, if QEMU doesn't do the
> > same, it doesn't really help us.  Is QEMU stuck with an implementation
> > of only retrieving dirty bits per MemoryRegionSection exactly because
> > of this issue and therefore we can rely on it in our implementation as
> > well?  Thanks,
> >   
> 
> QEMU sync dirty_bitmap per MemoryRegionSection. Within 
> MemoryRegionSection there could be multiple KVMSlots. QEMU queries 
> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
> On kernel side, KVM_GET_DIRTY_LOG ioctl calls 
> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap 
> of that memSlot.
> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection 
> in our implementation. But to get bitmap during unmap, we have to take 
> care of concatenating bitmaps.

So KVM does not worry about bitmap alignment because the interface is
based on slots, a dirty bitmap can only be retrieved for a single,
entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
spanning multiple vfio_dmas, but maybe we have some leeway that we
don't need to support both multiple vfio_dmas and dirty bitmap at the
same time.  It seems like it would be a massive simplification if we
required an unmap with dirty bitmap to span exactly one vfio_dma,
right?  I don't see that we'd break any existing users with that, it's
unfortunate that we can't have the flexibility of the existing calling
convention, but I think there's good reason for it here.  Our separate
dirty bitmap log reporting would follow the same semantics.  I think
this all aligns with how the MemoryListener works in QEMU right now,
correct?  For example we wouldn't need any extra per MAP_DMA tracking
in QEMU like KVM has for its slots.

> In QEMU, in function kvm_physical_sync_dirty_bitmap() there is a comment 
> where bitmap size is calculated and bitmap is defined as 'void __user 
> *dirty_bitmap' which is also the concern you raised and could be handled 
> similarly as below.
> 
>          /* XXX bad kernel interface alert
>           * For dirty bitmap, kernel allocates array of size aligned to
>           * bits-per-long.  But for case when the kernel is 64bits and
>           * the userspace is 32bits, userspace can't align to the same
>           * bits-per-long, since sizeof(long) is different between kernel
>           * and user space.  This way, userspace will provide buffer which
>           * may be 4 bytes less than the kernel will use, resulting in
>           * userspace memory corruption (which is not detectable by valgrind
>           * too, in most cases).
>           * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
>           * a hope that sizeof(long) won't become >8 any time soon.
>           */
>          if (!mem->dirty_bmap) {
>              hwaddr bitmap_size = ALIGN(((mem->memory_size) >> 
> TARGET_PAGE_BITS),
>                                          /*HOST_LONG_BITS*/ 64) / 8;
>              /* Allocate on the first log_sync, once and for all */
>              mem->dirty_bmap = g_malloc0(bitmap_size);
>          }

Sort of, the the KVM ioctl seems to just pass a slot number and user
dirty bitmap pointer, so the size of the bitmap is inferred by the size
of the slot, but if both kernel and user round up to a multiple of
longs they might come up with different lengths.  QEMU therefore decides
to always round up the size for an LP64 based long.  Since you've
specified bitmap_size in our ioctl, the size agreement is explicit.

The concern I had looks like it addressed in KVM by placing the void*
__user pointer in a union with a u64:

struct kvm_dirty_log {
        __u32 slot;
        __u32 padding1;
        union {
                void __user *dirty_bitmap; /* one bit per page */
                __u64 padding2;
        };
};

The the kvm_vm_compat_ioctl() ioctl handles this with it's own private
structure:

truct compat_kvm_dirty_log {
        __u32 slot;
        __u32 padding1;
        union {
                compat_uptr_t dirty_bitmap; /* one bit per page */
                __u64 padding2;
        };
};

Which gets extracted via:

	log.dirty_bitmap = compat_ptr(compat_log.dirty_bitmap);

However, compat_ptr() has:

/*
 * A pointer passed in from user mode. This should not
 * be used for syscall parameters, just declare them
 * as pointers because the syscall entry code will have
 * appropriately converted them already.
 */
#ifndef compat_ptr
static inline void __user *compat_ptr(compat_uptr_t uptr)
{
        return (void __user *)(unsigned long)uptr;
}
#endif

So maybe we don't need to do anything special?  I'm tempted to think
the KVM handling is using legacy mechanism or the padding in the union
was assumed not to be for that purpose.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-18 21:41                     ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-18 21:41 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Tue, 18 Feb 2020 11:28:53 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> <snip>
> 
> >>>>>    As I understand the above algorithm, we find a vfio_dma
> >>>>> overlapping the request and populate the bitmap for that range.  Then
> >>>>> we go back and put_user() for each byte that we touched.  We could
> >>>>> instead simply work on a one byte buffer as we enumerate the requested
> >>>>> range and do a put_user() ever time we reach the end of it and have bits
> >>>>> set. That would greatly simplify the above example.  But I would expect
> >>>>> that we're a) more likely to get asked for ranges covering a single
> >>>>> vfio_dma  
> >>>>
> >>>> QEMU ask for single vfio_dma during each iteration.
> >>>>
> >>>> If we restrict this ABI to cover single vfio_dma only, then it
> >>>> simplifies the logic here. That was my original suggestion. Should we
> >>>> think about that again?  
> >>>
> >>> But we currently allow unmaps that overlap multiple vfio_dmas as long
> >>> as no vfio_dma is bisected, so I think that implies that an unmap while
> >>> asking for the dirty bitmap has even further restricted semantics.  I'm
> >>> also reluctant to design an ABI around what happens to be the current
> >>> QEMU implementation.
> >>>
> >>> If we take your example above, ranges {0x0000,0xa000} and
> >>> {0xa000,0x10000} ({start,end}), I think you're working with the
> >>> following two bitmaps in this implementation:
> >>>
> >>> 00000011 11111111b
> >>> 00111111b
> >>>
> >>> And we need to combine those into:
> >>>
> >>> 11111111 11111111b
> >>>
> >>> Right?
> >>>
> >>> But it seems like that would be easier if the second bitmap was instead:
> >>>
> >>> 11111100b
> >>>
> >>> Then we wouldn't need to worry about the entire bitmap being shifted by
> >>> the bit offset within the byte, which limits our fixes to the boundary
> >>> byte and allows us to use copy_to_user() directly for the bulk of the
> >>> copy.  So how do we get there?
> >>>
> >>> I think we start with allocating the vfio_dma bitmap to account for
> >>> this initial offset, so we calculate bitmap_base_iova as:
> >>>     (iova & ~((PAGE_SIZE << 3) - 1))
> >>> We then use bitmap_base_iova in calculating which bits to set.
> >>>
> >>> The user needs to follow the same rules, and maybe this adds some value
> >>> to the user providing the bitmap size rather than the kernel
> >>> calculating it.  For example, if the user wanted the dirty bitmap for
> >>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> >>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> >>>
> >>> Effectively the user can ask for any iova range, but the buffer will be
> >>> filled relative to the zeroth bit of the bitmap following the above
> >>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> >>> requested pgsize).  I'm tempted to make this explicit in the user
> >>> interface (ie. only allow bitmaps starting on aligned pages), but a
> >>> user is able to map and unmap single pages and we need to support
> >>> returning a dirty bitmap with an unmap, so I don't think we can do that.
> >>>      
> >>
> >> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
> >> this.  
> > 
> > How does KVM do this?  My intent was that if all of our bitmaps share
> > the same alignment then we can merge the intersection and continue to
> > use copy_to_user() on either side.  However, if QEMU doesn't do the
> > same, it doesn't really help us.  Is QEMU stuck with an implementation
> > of only retrieving dirty bits per MemoryRegionSection exactly because
> > of this issue and therefore we can rely on it in our implementation as
> > well?  Thanks,
> >   
> 
> QEMU sync dirty_bitmap per MemoryRegionSection. Within 
> MemoryRegionSection there could be multiple KVMSlots. QEMU queries 
> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
> On kernel side, KVM_GET_DIRTY_LOG ioctl calls 
> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap 
> of that memSlot.
> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection 
> in our implementation. But to get bitmap during unmap, we have to take 
> care of concatenating bitmaps.

So KVM does not worry about bitmap alignment because the interface is
based on slots, a dirty bitmap can only be retrieved for a single,
entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
spanning multiple vfio_dmas, but maybe we have some leeway that we
don't need to support both multiple vfio_dmas and dirty bitmap at the
same time.  It seems like it would be a massive simplification if we
required an unmap with dirty bitmap to span exactly one vfio_dma,
right?  I don't see that we'd break any existing users with that, it's
unfortunate that we can't have the flexibility of the existing calling
convention, but I think there's good reason for it here.  Our separate
dirty bitmap log reporting would follow the same semantics.  I think
this all aligns with how the MemoryListener works in QEMU right now,
correct?  For example we wouldn't need any extra per MAP_DMA tracking
in QEMU like KVM has for its slots.

> In QEMU, in function kvm_physical_sync_dirty_bitmap() there is a comment 
> where bitmap size is calculated and bitmap is defined as 'void __user 
> *dirty_bitmap' which is also the concern you raised and could be handled 
> similarly as below.
> 
>          /* XXX bad kernel interface alert
>           * For dirty bitmap, kernel allocates array of size aligned to
>           * bits-per-long.  But for case when the kernel is 64bits and
>           * the userspace is 32bits, userspace can't align to the same
>           * bits-per-long, since sizeof(long) is different between kernel
>           * and user space.  This way, userspace will provide buffer which
>           * may be 4 bytes less than the kernel will use, resulting in
>           * userspace memory corruption (which is not detectable by valgrind
>           * too, in most cases).
>           * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
>           * a hope that sizeof(long) won't become >8 any time soon.
>           */
>          if (!mem->dirty_bmap) {
>              hwaddr bitmap_size = ALIGN(((mem->memory_size) >> 
> TARGET_PAGE_BITS),
>                                          /*HOST_LONG_BITS*/ 64) / 8;
>              /* Allocate on the first log_sync, once and for all */
>              mem->dirty_bmap = g_malloc0(bitmap_size);
>          }

Sort of, the the KVM ioctl seems to just pass a slot number and user
dirty bitmap pointer, so the size of the bitmap is inferred by the size
of the slot, but if both kernel and user round up to a multiple of
longs they might come up with different lengths.  QEMU therefore decides
to always round up the size for an LP64 based long.  Since you've
specified bitmap_size in our ioctl, the size agreement is explicit.

The concern I had looks like it addressed in KVM by placing the void*
__user pointer in a union with a u64:

struct kvm_dirty_log {
        __u32 slot;
        __u32 padding1;
        union {
                void __user *dirty_bitmap; /* one bit per page */
                __u64 padding2;
        };
};

The the kvm_vm_compat_ioctl() ioctl handles this with it's own private
structure:

truct compat_kvm_dirty_log {
        __u32 slot;
        __u32 padding1;
        union {
                compat_uptr_t dirty_bitmap; /* one bit per page */
                __u64 padding2;
        };
};

Which gets extracted via:

	log.dirty_bitmap = compat_ptr(compat_log.dirty_bitmap);

However, compat_ptr() has:

/*
 * A pointer passed in from user mode. This should not
 * be used for syscall parameters, just declare them
 * as pointers because the syscall entry code will have
 * appropriately converted them already.
 */
#ifndef compat_ptr
static inline void __user *compat_ptr(compat_uptr_t uptr)
{
        return (void __user *)(unsigned long)uptr;
}
#endif

So maybe we don't need to do anything special?  I'm tempted to think
the KVM handling is using legacy mechanism or the padding in the union
was assumed not to be for that purpose.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-18 21:41                     ` Alex Williamson
@ 2020-02-19  4:21                       ` Kirti Wankhede
  -1 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-19  4:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 2/19/2020 3:11 AM, Alex Williamson wrote:
> On Tue, 18 Feb 2020 11:28:53 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> <snip>
>>
>>>>>>>     As I understand the above algorithm, we find a vfio_dma
>>>>>>> overlapping the request and populate the bitmap for that range.  Then
>>>>>>> we go back and put_user() for each byte that we touched.  We could
>>>>>>> instead simply work on a one byte buffer as we enumerate the requested
>>>>>>> range and do a put_user() ever time we reach the end of it and have bits
>>>>>>> set. That would greatly simplify the above example.  But I would expect
>>>>>>> that we're a) more likely to get asked for ranges covering a single
>>>>>>> vfio_dma
>>>>>>
>>>>>> QEMU ask for single vfio_dma during each iteration.
>>>>>>
>>>>>> If we restrict this ABI to cover single vfio_dma only, then it
>>>>>> simplifies the logic here. That was my original suggestion. Should we
>>>>>> think about that again?
>>>>>
>>>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
>>>>> as no vfio_dma is bisected, so I think that implies that an unmap while
>>>>> asking for the dirty bitmap has even further restricted semantics.  I'm
>>>>> also reluctant to design an ABI around what happens to be the current
>>>>> QEMU implementation.
>>>>>
>>>>> If we take your example above, ranges {0x0000,0xa000} and
>>>>> {0xa000,0x10000} ({start,end}), I think you're working with the
>>>>> following two bitmaps in this implementation:
>>>>>
>>>>> 00000011 11111111b
>>>>> 00111111b
>>>>>
>>>>> And we need to combine those into:
>>>>>
>>>>> 11111111 11111111b
>>>>>
>>>>> Right?
>>>>>
>>>>> But it seems like that would be easier if the second bitmap was instead:
>>>>>
>>>>> 11111100b
>>>>>
>>>>> Then we wouldn't need to worry about the entire bitmap being shifted by
>>>>> the bit offset within the byte, which limits our fixes to the boundary
>>>>> byte and allows us to use copy_to_user() directly for the bulk of the
>>>>> copy.  So how do we get there?
>>>>>
>>>>> I think we start with allocating the vfio_dma bitmap to account for
>>>>> this initial offset, so we calculate bitmap_base_iova as:
>>>>>      (iova & ~((PAGE_SIZE << 3) - 1))
>>>>> We then use bitmap_base_iova in calculating which bits to set.
>>>>>
>>>>> The user needs to follow the same rules, and maybe this adds some value
>>>>> to the user providing the bitmap size rather than the kernel
>>>>> calculating it.  For example, if the user wanted the dirty bitmap for
>>>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
>>>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
>>>>>
>>>>> Effectively the user can ask for any iova range, but the buffer will be
>>>>> filled relative to the zeroth bit of the bitmap following the above
>>>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
>>>>> requested pgsize).  I'm tempted to make this explicit in the user
>>>>> interface (ie. only allow bitmaps starting on aligned pages), but a
>>>>> user is able to map and unmap single pages and we need to support
>>>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
>>>>>       
>>>>
>>>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
>>>> this.
>>>
>>> How does KVM do this?  My intent was that if all of our bitmaps share
>>> the same alignment then we can merge the intersection and continue to
>>> use copy_to_user() on either side.  However, if QEMU doesn't do the
>>> same, it doesn't really help us.  Is QEMU stuck with an implementation
>>> of only retrieving dirty bits per MemoryRegionSection exactly because
>>> of this issue and therefore we can rely on it in our implementation as
>>> well?  Thanks,
>>>    
>>
>> QEMU sync dirty_bitmap per MemoryRegionSection. Within
>> MemoryRegionSection there could be multiple KVMSlots. QEMU queries
>> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
>> On kernel side, KVM_GET_DIRTY_LOG ioctl calls
>> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap
>> of that memSlot.
>> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection
>> in our implementation. But to get bitmap during unmap, we have to take
>> care of concatenating bitmaps.
> 
> So KVM does not worry about bitmap alignment because the interface is
> based on slots, a dirty bitmap can only be retrieved for a single,
> entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
> spanning multiple vfio_dmas, but maybe we have some leeway that we
> don't need to support both multiple vfio_dmas and dirty bitmap at the
> same time.  It seems like it would be a massive simplification if we
> required an unmap with dirty bitmap to span exactly one vfio_dma,
> right? 

Yes.

> I don't see that we'd break any existing users with that, it's
> unfortunate that we can't have the flexibility of the existing calling
> convention, but I think there's good reason for it here.  Our separate
> dirty bitmap log reporting would follow the same semantics.  I think
> this all aligns with how the MemoryListener works in QEMU right now,
> correct?  For example we wouldn't need any extra per MAP_DMA tracking
> in QEMU like KVM has for its slots.
> 

That right.
Should we go ahead with the implementation to get dirty bitmap for one 
vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we 
can have sanity checks in these ioctls.

Thanks,
Kirti

>> In QEMU, in function kvm_physical_sync_dirty_bitmap() there is a comment
>> where bitmap size is calculated and bitmap is defined as 'void __user
>> *dirty_bitmap' which is also the concern you raised and could be handled
>> similarly as below.
>>
>>           /* XXX bad kernel interface alert
>>            * For dirty bitmap, kernel allocates array of size aligned to
>>            * bits-per-long.  But for case when the kernel is 64bits and
>>            * the userspace is 32bits, userspace can't align to the same
>>            * bits-per-long, since sizeof(long) is different between kernel
>>            * and user space.  This way, userspace will provide buffer which
>>            * may be 4 bytes less than the kernel will use, resulting in
>>            * userspace memory corruption (which is not detectable by valgrind
>>            * too, in most cases).
>>            * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
>>            * a hope that sizeof(long) won't become >8 any time soon.
>>            */
>>           if (!mem->dirty_bmap) {
>>               hwaddr bitmap_size = ALIGN(((mem->memory_size) >>
>> TARGET_PAGE_BITS),
>>                                           /*HOST_LONG_BITS*/ 64) / 8;
>>               /* Allocate on the first log_sync, once and for all */
>>               mem->dirty_bmap = g_malloc0(bitmap_size);
>>           }
> 
> Sort of, the the KVM ioctl seems to just pass a slot number and user
> dirty bitmap pointer, so the size of the bitmap is inferred by the size
> of the slot, but if both kernel and user round up to a multiple of
> longs they might come up with different lengths.  QEMU therefore decides
> to always round up the size for an LP64 based long.  Since you've
> specified bitmap_size in our ioctl, the size agreement is explicit.
> 
> The concern I had looks like it addressed in KVM by placing the void*
> __user pointer in a union with a u64:
> 
> struct kvm_dirty_log {
>          __u32 slot;
>          __u32 padding1;
>          union {
>                  void __user *dirty_bitmap; /* one bit per page */
>                  __u64 padding2;
>          };
> };
> 
> The the kvm_vm_compat_ioctl() ioctl handles this with it's own private
> structure:
> 
> truct compat_kvm_dirty_log {
>          __u32 slot;
>          __u32 padding1;
>          union {
>                  compat_uptr_t dirty_bitmap; /* one bit per page */
>                  __u64 padding2;
>          };
> };
> 
> Which gets extracted via:
> 
> 	log.dirty_bitmap = compat_ptr(compat_log.dirty_bitmap);
> 
> However, compat_ptr() has:
> 
> /*
>   * A pointer passed in from user mode. This should not
>   * be used for syscall parameters, just declare them
>   * as pointers because the syscall entry code will have
>   * appropriately converted them already.
>   */
> #ifndef compat_ptr
> static inline void __user *compat_ptr(compat_uptr_t uptr)
> {
>          return (void __user *)(unsigned long)uptr;
> }
> #endif
> 
> So maybe we don't need to do anything special?  I'm tempted to think
> the KVM handling is using legacy mechanism or the padding in the union
> was assumed not to be for that purpose.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-19  4:21                       ` Kirti Wankhede
  0 siblings, 0 replies; 62+ messages in thread
From: Kirti Wankhede @ 2020-02-19  4:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue



On 2/19/2020 3:11 AM, Alex Williamson wrote:
> On Tue, 18 Feb 2020 11:28:53 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> <snip>
>>
>>>>>>>     As I understand the above algorithm, we find a vfio_dma
>>>>>>> overlapping the request and populate the bitmap for that range.  Then
>>>>>>> we go back and put_user() for each byte that we touched.  We could
>>>>>>> instead simply work on a one byte buffer as we enumerate the requested
>>>>>>> range and do a put_user() ever time we reach the end of it and have bits
>>>>>>> set. That would greatly simplify the above example.  But I would expect
>>>>>>> that we're a) more likely to get asked for ranges covering a single
>>>>>>> vfio_dma
>>>>>>
>>>>>> QEMU ask for single vfio_dma during each iteration.
>>>>>>
>>>>>> If we restrict this ABI to cover single vfio_dma only, then it
>>>>>> simplifies the logic here. That was my original suggestion. Should we
>>>>>> think about that again?
>>>>>
>>>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
>>>>> as no vfio_dma is bisected, so I think that implies that an unmap while
>>>>> asking for the dirty bitmap has even further restricted semantics.  I'm
>>>>> also reluctant to design an ABI around what happens to be the current
>>>>> QEMU implementation.
>>>>>
>>>>> If we take your example above, ranges {0x0000,0xa000} and
>>>>> {0xa000,0x10000} ({start,end}), I think you're working with the
>>>>> following two bitmaps in this implementation:
>>>>>
>>>>> 00000011 11111111b
>>>>> 00111111b
>>>>>
>>>>> And we need to combine those into:
>>>>>
>>>>> 11111111 11111111b
>>>>>
>>>>> Right?
>>>>>
>>>>> But it seems like that would be easier if the second bitmap was instead:
>>>>>
>>>>> 11111100b
>>>>>
>>>>> Then we wouldn't need to worry about the entire bitmap being shifted by
>>>>> the bit offset within the byte, which limits our fixes to the boundary
>>>>> byte and allows us to use copy_to_user() directly for the bulk of the
>>>>> copy.  So how do we get there?
>>>>>
>>>>> I think we start with allocating the vfio_dma bitmap to account for
>>>>> this initial offset, so we calculate bitmap_base_iova as:
>>>>>      (iova & ~((PAGE_SIZE << 3) - 1))
>>>>> We then use bitmap_base_iova in calculating which bits to set.
>>>>>
>>>>> The user needs to follow the same rules, and maybe this adds some value
>>>>> to the user providing the bitmap size rather than the kernel
>>>>> calculating it.  For example, if the user wanted the dirty bitmap for
>>>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
>>>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
>>>>>
>>>>> Effectively the user can ask for any iova range, but the buffer will be
>>>>> filled relative to the zeroth bit of the bitmap following the above
>>>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
>>>>> requested pgsize).  I'm tempted to make this explicit in the user
>>>>> interface (ie. only allow bitmaps starting on aligned pages), but a
>>>>> user is able to map and unmap single pages and we need to support
>>>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
>>>>>       
>>>>
>>>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
>>>> this.
>>>
>>> How does KVM do this?  My intent was that if all of our bitmaps share
>>> the same alignment then we can merge the intersection and continue to
>>> use copy_to_user() on either side.  However, if QEMU doesn't do the
>>> same, it doesn't really help us.  Is QEMU stuck with an implementation
>>> of only retrieving dirty bits per MemoryRegionSection exactly because
>>> of this issue and therefore we can rely on it in our implementation as
>>> well?  Thanks,
>>>    
>>
>> QEMU sync dirty_bitmap per MemoryRegionSection. Within
>> MemoryRegionSection there could be multiple KVMSlots. QEMU queries
>> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
>> On kernel side, KVM_GET_DIRTY_LOG ioctl calls
>> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap
>> of that memSlot.
>> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection
>> in our implementation. But to get bitmap during unmap, we have to take
>> care of concatenating bitmaps.
> 
> So KVM does not worry about bitmap alignment because the interface is
> based on slots, a dirty bitmap can only be retrieved for a single,
> entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
> spanning multiple vfio_dmas, but maybe we have some leeway that we
> don't need to support both multiple vfio_dmas and dirty bitmap at the
> same time.  It seems like it would be a massive simplification if we
> required an unmap with dirty bitmap to span exactly one vfio_dma,
> right? 

Yes.

> I don't see that we'd break any existing users with that, it's
> unfortunate that we can't have the flexibility of the existing calling
> convention, but I think there's good reason for it here.  Our separate
> dirty bitmap log reporting would follow the same semantics.  I think
> this all aligns with how the MemoryListener works in QEMU right now,
> correct?  For example we wouldn't need any extra per MAP_DMA tracking
> in QEMU like KVM has for its slots.
> 

That right.
Should we go ahead with the implementation to get dirty bitmap for one 
vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we 
can have sanity checks in these ioctls.

Thanks,
Kirti

>> In QEMU, in function kvm_physical_sync_dirty_bitmap() there is a comment
>> where bitmap size is calculated and bitmap is defined as 'void __user
>> *dirty_bitmap' which is also the concern you raised and could be handled
>> similarly as below.
>>
>>           /* XXX bad kernel interface alert
>>            * For dirty bitmap, kernel allocates array of size aligned to
>>            * bits-per-long.  But for case when the kernel is 64bits and
>>            * the userspace is 32bits, userspace can't align to the same
>>            * bits-per-long, since sizeof(long) is different between kernel
>>            * and user space.  This way, userspace will provide buffer which
>>            * may be 4 bytes less than the kernel will use, resulting in
>>            * userspace memory corruption (which is not detectable by valgrind
>>            * too, in most cases).
>>            * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
>>            * a hope that sizeof(long) won't become >8 any time soon.
>>            */
>>           if (!mem->dirty_bmap) {
>>               hwaddr bitmap_size = ALIGN(((mem->memory_size) >>
>> TARGET_PAGE_BITS),
>>                                           /*HOST_LONG_BITS*/ 64) / 8;
>>               /* Allocate on the first log_sync, once and for all */
>>               mem->dirty_bmap = g_malloc0(bitmap_size);
>>           }
> 
> Sort of, the the KVM ioctl seems to just pass a slot number and user
> dirty bitmap pointer, so the size of the bitmap is inferred by the size
> of the slot, but if both kernel and user round up to a multiple of
> longs they might come up with different lengths.  QEMU therefore decides
> to always round up the size for an LP64 based long.  Since you've
> specified bitmap_size in our ioctl, the size agreement is explicit.
> 
> The concern I had looks like it addressed in KVM by placing the void*
> __user pointer in a union with a u64:
> 
> struct kvm_dirty_log {
>          __u32 slot;
>          __u32 padding1;
>          union {
>                  void __user *dirty_bitmap; /* one bit per page */
>                  __u64 padding2;
>          };
> };
> 
> The the kvm_vm_compat_ioctl() ioctl handles this with it's own private
> structure:
> 
> truct compat_kvm_dirty_log {
>          __u32 slot;
>          __u32 padding1;
>          union {
>                  compat_uptr_t dirty_bitmap; /* one bit per page */
>                  __u64 padding2;
>          };
> };
> 
> Which gets extracted via:
> 
> 	log.dirty_bitmap = compat_ptr(compat_log.dirty_bitmap);
> 
> However, compat_ptr() has:
> 
> /*
>   * A pointer passed in from user mode. This should not
>   * be used for syscall parameters, just declare them
>   * as pointers because the syscall entry code will have
>   * appropriately converted them already.
>   */
> #ifndef compat_ptr
> static inline void __user *compat_ptr(compat_uptr_t uptr)
> {
>          return (void __user *)(unsigned long)uptr;
> }
> #endif
> 
> So maybe we don't need to do anything special?  I'm tempted to think
> the KVM handling is using legacy mechanism or the padding in the union
> was assumed not to be for that purpose.  Thanks,
> 
> Alex
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-02-19  4:21                       ` Kirti Wankhede
@ 2020-02-19  4:53                         ` Alex Williamson
  -1 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-19  4:53 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Wed, 19 Feb 2020 09:51:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 2/19/2020 3:11 AM, Alex Williamson wrote:
> > On Tue, 18 Feb 2020 11:28:53 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> <snip>
> >>  
> >>>>>>>     As I understand the above algorithm, we find a vfio_dma
> >>>>>>> overlapping the request and populate the bitmap for that range.  Then
> >>>>>>> we go back and put_user() for each byte that we touched.  We could
> >>>>>>> instead simply work on a one byte buffer as we enumerate the requested
> >>>>>>> range and do a put_user() ever time we reach the end of it and have bits
> >>>>>>> set. That would greatly simplify the above example.  But I would expect
> >>>>>>> that we're a) more likely to get asked for ranges covering a single
> >>>>>>> vfio_dma  
> >>>>>>
> >>>>>> QEMU ask for single vfio_dma during each iteration.
> >>>>>>
> >>>>>> If we restrict this ABI to cover single vfio_dma only, then it
> >>>>>> simplifies the logic here. That was my original suggestion. Should we
> >>>>>> think about that again?  
> >>>>>
> >>>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
> >>>>> as no vfio_dma is bisected, so I think that implies that an unmap while
> >>>>> asking for the dirty bitmap has even further restricted semantics.  I'm
> >>>>> also reluctant to design an ABI around what happens to be the current
> >>>>> QEMU implementation.
> >>>>>
> >>>>> If we take your example above, ranges {0x0000,0xa000} and
> >>>>> {0xa000,0x10000} ({start,end}), I think you're working with the
> >>>>> following two bitmaps in this implementation:
> >>>>>
> >>>>> 00000011 11111111b
> >>>>> 00111111b
> >>>>>
> >>>>> And we need to combine those into:
> >>>>>
> >>>>> 11111111 11111111b
> >>>>>
> >>>>> Right?
> >>>>>
> >>>>> But it seems like that would be easier if the second bitmap was instead:
> >>>>>
> >>>>> 11111100b
> >>>>>
> >>>>> Then we wouldn't need to worry about the entire bitmap being shifted by
> >>>>> the bit offset within the byte, which limits our fixes to the boundary
> >>>>> byte and allows us to use copy_to_user() directly for the bulk of the
> >>>>> copy.  So how do we get there?
> >>>>>
> >>>>> I think we start with allocating the vfio_dma bitmap to account for
> >>>>> this initial offset, so we calculate bitmap_base_iova as:
> >>>>>      (iova & ~((PAGE_SIZE << 3) - 1))
> >>>>> We then use bitmap_base_iova in calculating which bits to set.
> >>>>>
> >>>>> The user needs to follow the same rules, and maybe this adds some value
> >>>>> to the user providing the bitmap size rather than the kernel
> >>>>> calculating it.  For example, if the user wanted the dirty bitmap for
> >>>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> >>>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> >>>>>
> >>>>> Effectively the user can ask for any iova range, but the buffer will be
> >>>>> filled relative to the zeroth bit of the bitmap following the above
> >>>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> >>>>> requested pgsize).  I'm tempted to make this explicit in the user
> >>>>> interface (ie. only allow bitmaps starting on aligned pages), but a
> >>>>> user is able to map and unmap single pages and we need to support
> >>>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
> >>>>>         
> >>>>
> >>>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
> >>>> this.  
> >>>
> >>> How does KVM do this?  My intent was that if all of our bitmaps share
> >>> the same alignment then we can merge the intersection and continue to
> >>> use copy_to_user() on either side.  However, if QEMU doesn't do the
> >>> same, it doesn't really help us.  Is QEMU stuck with an implementation
> >>> of only retrieving dirty bits per MemoryRegionSection exactly because
> >>> of this issue and therefore we can rely on it in our implementation as
> >>> well?  Thanks,
> >>>      
> >>
> >> QEMU sync dirty_bitmap per MemoryRegionSection. Within
> >> MemoryRegionSection there could be multiple KVMSlots. QEMU queries
> >> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
> >> On kernel side, KVM_GET_DIRTY_LOG ioctl calls
> >> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap
> >> of that memSlot.
> >> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection
> >> in our implementation. But to get bitmap during unmap, we have to take
> >> care of concatenating bitmaps.  
> > 
> > So KVM does not worry about bitmap alignment because the interface is
> > based on slots, a dirty bitmap can only be retrieved for a single,
> > entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
> > spanning multiple vfio_dmas, but maybe we have some leeway that we
> > don't need to support both multiple vfio_dmas and dirty bitmap at the
> > same time.  It seems like it would be a massive simplification if we
> > required an unmap with dirty bitmap to span exactly one vfio_dma,
> > right?   
> 
> Yes.
> 
> > I don't see that we'd break any existing users with that, it's
> > unfortunate that we can't have the flexibility of the existing calling
> > convention, but I think there's good reason for it here.  Our separate
> > dirty bitmap log reporting would follow the same semantics.  I think
> > this all aligns with how the MemoryListener works in QEMU right now,
> > correct?  For example we wouldn't need any extra per MAP_DMA tracking
> > in QEMU like KVM has for its slots.
> >   
> 
> That right.
> Should we go ahead with the implementation to get dirty bitmap for one 
> vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we 
> can have sanity checks in these ioctls.

Yes, I'm convinced that bitmap alignment is sufficiently too difficult
and unnecessary to restrict the calling convention of UNMAP_DMA, when
using the dirty bitmap extension, to exactly unmap a single vfio_dma.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
@ 2020-02-19  4:53                         ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2020-02-19  4:53 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, cjia, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, yan.y.zhao, changpeng.liu, Ken.Xue

On Wed, 19 Feb 2020 09:51:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 2/19/2020 3:11 AM, Alex Williamson wrote:
> > On Tue, 18 Feb 2020 11:28:53 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> <snip>
> >>  
> >>>>>>>     As I understand the above algorithm, we find a vfio_dma
> >>>>>>> overlapping the request and populate the bitmap for that range.  Then
> >>>>>>> we go back and put_user() for each byte that we touched.  We could
> >>>>>>> instead simply work on a one byte buffer as we enumerate the requested
> >>>>>>> range and do a put_user() ever time we reach the end of it and have bits
> >>>>>>> set. That would greatly simplify the above example.  But I would expect
> >>>>>>> that we're a) more likely to get asked for ranges covering a single
> >>>>>>> vfio_dma  
> >>>>>>
> >>>>>> QEMU ask for single vfio_dma during each iteration.
> >>>>>>
> >>>>>> If we restrict this ABI to cover single vfio_dma only, then it
> >>>>>> simplifies the logic here. That was my original suggestion. Should we
> >>>>>> think about that again?  
> >>>>>
> >>>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
> >>>>> as no vfio_dma is bisected, so I think that implies that an unmap while
> >>>>> asking for the dirty bitmap has even further restricted semantics.  I'm
> >>>>> also reluctant to design an ABI around what happens to be the current
> >>>>> QEMU implementation.
> >>>>>
> >>>>> If we take your example above, ranges {0x0000,0xa000} and
> >>>>> {0xa000,0x10000} ({start,end}), I think you're working with the
> >>>>> following two bitmaps in this implementation:
> >>>>>
> >>>>> 00000011 11111111b
> >>>>> 00111111b
> >>>>>
> >>>>> And we need to combine those into:
> >>>>>
> >>>>> 11111111 11111111b
> >>>>>
> >>>>> Right?
> >>>>>
> >>>>> But it seems like that would be easier if the second bitmap was instead:
> >>>>>
> >>>>> 11111100b
> >>>>>
> >>>>> Then we wouldn't need to worry about the entire bitmap being shifted by
> >>>>> the bit offset within the byte, which limits our fixes to the boundary
> >>>>> byte and allows us to use copy_to_user() directly for the bulk of the
> >>>>> copy.  So how do we get there?
> >>>>>
> >>>>> I think we start with allocating the vfio_dma bitmap to account for
> >>>>> this initial offset, so we calculate bitmap_base_iova as:
> >>>>>      (iova & ~((PAGE_SIZE << 3) - 1))
> >>>>> We then use bitmap_base_iova in calculating which bits to set.
> >>>>>
> >>>>> The user needs to follow the same rules, and maybe this adds some value
> >>>>> to the user providing the bitmap size rather than the kernel
> >>>>> calculating it.  For example, if the user wanted the dirty bitmap for
> >>>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> >>>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> >>>>>
> >>>>> Effectively the user can ask for any iova range, but the buffer will be
> >>>>> filled relative to the zeroth bit of the bitmap following the above
> >>>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> >>>>> requested pgsize).  I'm tempted to make this explicit in the user
> >>>>> interface (ie. only allow bitmaps starting on aligned pages), but a
> >>>>> user is able to map and unmap single pages and we need to support
> >>>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
> >>>>>         
> >>>>
> >>>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
> >>>> this.  
> >>>
> >>> How does KVM do this?  My intent was that if all of our bitmaps share
> >>> the same alignment then we can merge the intersection and continue to
> >>> use copy_to_user() on either side.  However, if QEMU doesn't do the
> >>> same, it doesn't really help us.  Is QEMU stuck with an implementation
> >>> of only retrieving dirty bits per MemoryRegionSection exactly because
> >>> of this issue and therefore we can rely on it in our implementation as
> >>> well?  Thanks,
> >>>      
> >>
> >> QEMU sync dirty_bitmap per MemoryRegionSection. Within
> >> MemoryRegionSection there could be multiple KVMSlots. QEMU queries
> >> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
> >> On kernel side, KVM_GET_DIRTY_LOG ioctl calls
> >> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap
> >> of that memSlot.
> >> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection
> >> in our implementation. But to get bitmap during unmap, we have to take
> >> care of concatenating bitmaps.  
> > 
> > So KVM does not worry about bitmap alignment because the interface is
> > based on slots, a dirty bitmap can only be retrieved for a single,
> > entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
> > spanning multiple vfio_dmas, but maybe we have some leeway that we
> > don't need to support both multiple vfio_dmas and dirty bitmap at the
> > same time.  It seems like it would be a massive simplification if we
> > required an unmap with dirty bitmap to span exactly one vfio_dma,
> > right?   
> 
> Yes.
> 
> > I don't see that we'd break any existing users with that, it's
> > unfortunate that we can't have the flexibility of the existing calling
> > convention, but I think there's good reason for it here.  Our separate
> > dirty bitmap log reporting would follow the same semantics.  I think
> > this all aligns with how the MemoryListener works in QEMU right now,
> > correct?  For example we wouldn't need any extra per MAP_DMA tracking
> > in QEMU like KVM has for its slots.
> >   
> 
> That right.
> Should we go ahead with the implementation to get dirty bitmap for one 
> vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we 
> can have sanity checks in these ioctls.

Yes, I'm convinced that bitmap alignment is sufficiently too difficult
and unnecessary to restrict the calling convention of UNMAP_DMA, when
using the dirty bitmap extension, to exactly unmap a single vfio_dma.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
  2020-02-07 19:42   ` Kirti Wankhede
@ 2020-02-27  8:58     ` Yan Zhao
  -1 siblings, 0 replies; 62+ messages in thread
From: Yan Zhao @ 2020-02-27  8:58 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, cjia, Tian, Kevin, Yang, Ziye, Liu, Changpeng,
	Liu, Yi L, mlevitsk, eskultet, cohuck, dgilbert, jonathan.davies,
	eauger, aik, pasic, felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue,
	Wang, Zhi A, qemu-devel, kvm

On Sat, Feb 08, 2020 at 03:42:28AM +0800, Kirti Wankhede wrote:
> - Defined MIGRATION region type and sub-type.
> 
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access.
> 
> - Defined device states and state transition details.
> 
> - Defined sequence to be followed while saving and resuming VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 208 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 208 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a147ead..572242620ce9 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_TYPE_CCW			(2)
> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>  
>  /* sub-types for VFIO_REGION_TYPE_PCI_* */
>  
> @@ -379,6 +380,213 @@ struct vfio_region_gfx_edid {
>  /* sub-types for VFIO_REGION_TYPE_CCW */
>  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>  
> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> +
> +/*
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      - User application writes this field to inform vendor driver about the
> + *        device state to be transitioned to.
> + *      - Vendor driver should take necessary actions to change device state.
> + *        On successful transition to given state, vendor driver should return
> + *        success on write(device_state, state) system call. If device state
> + *        transition fails, vendor driver should return error, -EFAULT.
> + *      - On user application side, if device state transition fails, i.e. if
> + *        write(device_state, state) returns error, read device_state again to
> + *        determine the current state of the device from vendor driver.
> + *      - Vendor driver should return previous state of the device unless vendor
> + *        driver has encountered an internal error, in which case vendor driver
> + *        may report the device_state VFIO_DEVICE_STATE_ERROR.
> + *	- User application must use the device reset ioctl in order to recover
> + *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
> + *	  indicated in a valid device state via reading device_state, the user
> + *	  application may decide attempt to transition the device to any valid
> + *	  state reachable from the current state or terminate itself.
> + *
> + *      device_state consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
> + *	  indicates _STOP state. When device is changed to _STOP, driver should
> + *	  stop device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> + *        should start gathering device state information which will be provided
> + *        to VFIO user application to save device's state.
> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> + *        prepare to resume device, data provided through migration region
> + *        should be used to resume device.
> + *      Bits 3 - 31 are reserved for future use. In order to preserve them,
> + *	user application should perform read-modify-write operation on this
> + *	field when modifying the specified bits.
> + *
> + *  +------- _RESUMING
> + *  |+------ _SAVING
> + *  ||+----- _RUNNING
> + *  |||
> + *  000b => Device Stopped, not saving or resuming
> + *  001b => Device running state, default state
> + *  010b => Stop Device & save device state, stop-and-copy state
> + *  011b => Device running and save device state, pre-copy state
> + *  100b => Device stopped and device state is resuming
> + *  101b => Invalid state
> + *  110b => Error state
> + *  111b => Invalid state
> + *
> + * State transitions:
> + *
> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> + *                (100b)     (001b)     (011b)        (010b)       (000b)
> + * 0. Running or Default state
> + *                             |
> + *
> + * 1. Normal Shutdown (optional)
> + *                             |------------------------------------->|
> + *
> + * 2. Save state or Suspend
> + *                             |------------------------->|---------->|
> + *
> + * 3. Save state during live migration
> + *                             |----------->|------------>|---------->|
> + *
> + * 4. Resuming
> + *                  |<---------|
> + *
> + * 5. Resumed
> + *                  |--------->|
> + *
> + * 0. Default state of VFIO device is _RUNNNG when user application starts.
> + * 1. During normal user application shutdown, vfio device state changes
> + *    from _RUNNING to _STOP. This is optional, user application may or may not
> + *    perform this state transition and vendor driver may not need.
> + * 2. When user application save state or suspend application, device state
> + *    transitions from _RUNNING to stop-and-copy state and then to _STOP.
> + *    On state transition from _RUNNING to stop-and-copy, driver must
> + *    stop device, save device state and send it to application through
> + *    migration region. Sequence to be followed for such transition is given
> + *    below.
> + * 3. In user application live migration, state transitions from _RUNNING
> + *    to pre-copy to stop-and-copy to _STOP.
> + *    On state transition from _RUNNING to pre-copy, driver should start
> + *    gathering device state while application is still running and send device
> + *    state data to application through migration region.
> + *    On state transition from pre-copy to stop-and-copy, driver must stop
> + *    device, save device state and send it to user application through
> + *    migration region.
> + *    Sequence to be followed for above two transitions is given below.
> + * 4. To start resuming phase, device state should be transitioned from
> + *    _RUNNING to _RESUMING state.
> + *    In _RESUMING state, driver should use received device state data through
> + *    migration region to resume device.
> + * 5. On providing saved device data to driver, application should change state
> + *    from _RESUMING to _RUNNING.
> + *
> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state. See below for detail of sequence to
> + *      be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      bytes in migration region during _SAVING state and write size of data
> + *      copied in bytes in migration region during _RESUMING state.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^
> + *  offset 0-trapped part        data_offset
> + *
> + * Structure vfio_device_migration_info is always followed by data section in
> + * the region, so data_offset will always be non-0. Offset from where data is
> + * copied is decided by kernel driver, data section can be trapped or mapped
> + * or partitioned, depending on how kernel driver defines data section.
> + * Data section partition can be defined as mapped by sparse mmap capability.
> + * If mmapped, then data_offset should be page aligned, where as initial section
> + * which contain vfio_device_migration_info structure might not end at offset
> + * which is page aligned. The user is not required to access via mmap regardless
> + * of the region mmap capabilities.
> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.
> + *
> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> + * and for _SAVING device state or stop-and-copy phase:
> + * a. read pending_bytes, indicates start of new iteration to get device data.
> + *    Repeatative read on pending_bytes at this stage should have no side
> + *    effect.
if the data section is mmaped into user space, vendor driver is not able
to know when user application has finished reading of the data.
so, if user application reads pending_bytes repeatedly, vendor
driver actually does not know what value to return except by making
assumption that reading of data_size is a sign of data reading,
which is somewhat strange, as data_size is read before reading data.

e.g. vendor driver actually does not know how to handle below sequence
1. read pending_bytes
2. read data_offset
3. read pending_bytes
4. read data_size

and what if user space reads in below sequence but never launches a real
reading of data?
1. read pending_bytes
2. read data_offset
3. read data_size

Thanks
Yan
 
> + *    If pending_bytes == 0, user application should not iterate to get data
> + *    for that device.
> + *    If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates vendor driver to make data available through
> + *    data section. Vendor driver should return this read operation only after
> + *    data is available from (region + data_offset) to (region + data_offset +
> + *    data_size).
> + * c. read data_size, amount of data in bytes available through migration
> + *    region.
> + *    Read on data_offset and data_size should return offset and size of current
> + *    buffer if user application reads those more than once here.
> + * d. read data of data_size bytes from (region + data_offset) from migration
> + *    region.
> + * e. process data.
> + * f. read pending_bytes, this read operation indicates data from previous
> + *    iteration had read. If pending_bytes > 0, goto step b.
> + *
> + * If there is any error during the above sequence, vendor driver can return
> + * error code for next read()/write() operation, that will terminate the loop
> + * and user should take next necessary action, for example, fail migration or
> + * terminate user application.
> + *
> + * User application can transition from _SAVING|_RUNNING (pre-copy state) to
> + * _SAVING (stop-and-copy) state regardless of pending bytes.
> + * User application should iterate in _SAVING (stop-and-copy) until
> + * pending_bytes is 0.
> + *
> + * Sequence to be followed while _RESUMING device state:
> + * While data for this device is available, repeat below steps:
> + * a. read data_offset from where user application should write data.
> + * b. write data of data_size to migration region from data_offset. Data size
> + *    should be data packet size at source during _SAVING.
> + * c. write data_size which indicates vendor driver that data is written in
> + *    migration region. Vendor driver should read this data from migration
> + *    region and resume device's state.
> + *
> + * For user application, data is opaque. User application should write data in
> + * the same order as received and should of same transaction size at source.
> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_STOP      (0)
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +				     VFIO_DEVICE_STATE_SAVING |  \
> +				     VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_VALID(state) \
> +	(state & VFIO_DEVICE_STATE_RESUMING ? \
> +	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> +
> +#define VFIO_DEVICE_STATE_ERROR			\
> +		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)
> +
> +	__u32 reserved;
> +	__u64 pending_bytes;
> +	__u64 data_offset;
> +	__u64 data_size;
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> -- 
> 2.7.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state
@ 2020-02-27  8:58     ` Yan Zhao
  0 siblings, 0 replies; 62+ messages in thread
From: Yan Zhao @ 2020-02-27  8:58 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Zhengxiao.zx, Tian, Kevin, Liu, Yi L, cjia, kvm, eskultet, Yang,
	Ziye, qemu-devel, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, alex.williamson, eauger, felipe,
	jonathan.davies, Liu, Changpeng, Ken.Xue

On Sat, Feb 08, 2020 at 03:42:28AM +0800, Kirti Wankhede wrote:
> - Defined MIGRATION region type and sub-type.
> 
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access.
> 
> - Defined device states and state transition details.
> 
> - Defined sequence to be followed while saving and resuming VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 208 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 208 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a147ead..572242620ce9 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_TYPE_CCW			(2)
> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>  
>  /* sub-types for VFIO_REGION_TYPE_PCI_* */
>  
> @@ -379,6 +380,213 @@ struct vfio_region_gfx_edid {
>  /* sub-types for VFIO_REGION_TYPE_CCW */
>  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>  
> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> +
> +/*
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      - User application writes this field to inform vendor driver about the
> + *        device state to be transitioned to.
> + *      - Vendor driver should take necessary actions to change device state.
> + *        On successful transition to given state, vendor driver should return
> + *        success on write(device_state, state) system call. If device state
> + *        transition fails, vendor driver should return error, -EFAULT.
> + *      - On user application side, if device state transition fails, i.e. if
> + *        write(device_state, state) returns error, read device_state again to
> + *        determine the current state of the device from vendor driver.
> + *      - Vendor driver should return previous state of the device unless vendor
> + *        driver has encountered an internal error, in which case vendor driver
> + *        may report the device_state VFIO_DEVICE_STATE_ERROR.
> + *	- User application must use the device reset ioctl in order to recover
> + *	  the device from VFIO_DEVICE_STATE_ERROR state. If the device is
> + *	  indicated in a valid device state via reading device_state, the user
> + *	  application may decide attempt to transition the device to any valid
> + *	  state reachable from the current state or terminate itself.
> + *
> + *      device_state consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
> + *	  indicates _STOP state. When device is changed to _STOP, driver should
> + *	  stop device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> + *        should start gathering device state information which will be provided
> + *        to VFIO user application to save device's state.
> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> + *        prepare to resume device, data provided through migration region
> + *        should be used to resume device.
> + *      Bits 3 - 31 are reserved for future use. In order to preserve them,
> + *	user application should perform read-modify-write operation on this
> + *	field when modifying the specified bits.
> + *
> + *  +------- _RESUMING
> + *  |+------ _SAVING
> + *  ||+----- _RUNNING
> + *  |||
> + *  000b => Device Stopped, not saving or resuming
> + *  001b => Device running state, default state
> + *  010b => Stop Device & save device state, stop-and-copy state
> + *  011b => Device running and save device state, pre-copy state
> + *  100b => Device stopped and device state is resuming
> + *  101b => Invalid state
> + *  110b => Error state
> + *  111b => Invalid state
> + *
> + * State transitions:
> + *
> + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> + *                (100b)     (001b)     (011b)        (010b)       (000b)
> + * 0. Running or Default state
> + *                             |
> + *
> + * 1. Normal Shutdown (optional)
> + *                             |------------------------------------->|
> + *
> + * 2. Save state or Suspend
> + *                             |------------------------->|---------->|
> + *
> + * 3. Save state during live migration
> + *                             |----------->|------------>|---------->|
> + *
> + * 4. Resuming
> + *                  |<---------|
> + *
> + * 5. Resumed
> + *                  |--------->|
> + *
> + * 0. Default state of VFIO device is _RUNNNG when user application starts.
> + * 1. During normal user application shutdown, vfio device state changes
> + *    from _RUNNING to _STOP. This is optional, user application may or may not
> + *    perform this state transition and vendor driver may not need.
> + * 2. When user application save state or suspend application, device state
> + *    transitions from _RUNNING to stop-and-copy state and then to _STOP.
> + *    On state transition from _RUNNING to stop-and-copy, driver must
> + *    stop device, save device state and send it to application through
> + *    migration region. Sequence to be followed for such transition is given
> + *    below.
> + * 3. In user application live migration, state transitions from _RUNNING
> + *    to pre-copy to stop-and-copy to _STOP.
> + *    On state transition from _RUNNING to pre-copy, driver should start
> + *    gathering device state while application is still running and send device
> + *    state data to application through migration region.
> + *    On state transition from pre-copy to stop-and-copy, driver must stop
> + *    device, save device state and send it to user application through
> + *    migration region.
> + *    Sequence to be followed for above two transitions is given below.
> + * 4. To start resuming phase, device state should be transitioned from
> + *    _RUNNING to _RESUMING state.
> + *    In _RESUMING state, driver should use received device state data through
> + *    migration region to resume device.
> + * 5. On providing saved device data to driver, application should change state
> + *    from _RESUMING to _RUNNING.
> + *
> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state. See below for detail of sequence to
> + *      be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      bytes in migration region during _SAVING state and write size of data
> + *      copied in bytes in migration region during _RESUMING state.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^
> + *  offset 0-trapped part        data_offset
> + *
> + * Structure vfio_device_migration_info is always followed by data section in
> + * the region, so data_offset will always be non-0. Offset from where data is
> + * copied is decided by kernel driver, data section can be trapped or mapped
> + * or partitioned, depending on how kernel driver defines data section.
> + * Data section partition can be defined as mapped by sparse mmap capability.
> + * If mmapped, then data_offset should be page aligned, where as initial section
> + * which contain vfio_device_migration_info structure might not end at offset
> + * which is page aligned. The user is not required to access via mmap regardless
> + * of the region mmap capabilities.
> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.
> + *
> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> + * and for _SAVING device state or stop-and-copy phase:
> + * a. read pending_bytes, indicates start of new iteration to get device data.
> + *    Repeatative read on pending_bytes at this stage should have no side
> + *    effect.
if the data section is mmaped into user space, vendor driver is not able
to know when user application has finished reading of the data.
so, if user application reads pending_bytes repeatedly, vendor
driver actually does not know what value to return except by making
assumption that reading of data_size is a sign of data reading,
which is somewhat strange, as data_size is read before reading data.

e.g. vendor driver actually does not know how to handle below sequence
1. read pending_bytes
2. read data_offset
3. read pending_bytes
4. read data_size

and what if user space reads in below sequence but never launches a real
reading of data?
1. read pending_bytes
2. read data_offset
3. read data_size

Thanks
Yan
 
> + *    If pending_bytes == 0, user application should not iterate to get data
> + *    for that device.
> + *    If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates vendor driver to make data available through
> + *    data section. Vendor driver should return this read operation only after
> + *    data is available from (region + data_offset) to (region + data_offset +
> + *    data_size).
> + * c. read data_size, amount of data in bytes available through migration
> + *    region.
> + *    Read on data_offset and data_size should return offset and size of current
> + *    buffer if user application reads those more than once here.
> + * d. read data of data_size bytes from (region + data_offset) from migration
> + *    region.
> + * e. process data.
> + * f. read pending_bytes, this read operation indicates data from previous
> + *    iteration had read. If pending_bytes > 0, goto step b.
> + *
> + * If there is any error during the above sequence, vendor driver can return
> + * error code for next read()/write() operation, that will terminate the loop
> + * and user should take next necessary action, for example, fail migration or
> + * terminate user application.
> + *
> + * User application can transition from _SAVING|_RUNNING (pre-copy state) to
> + * _SAVING (stop-and-copy) state regardless of pending bytes.
> + * User application should iterate in _SAVING (stop-and-copy) until
> + * pending_bytes is 0.
> + *
> + * Sequence to be followed while _RESUMING device state:
> + * While data for this device is available, repeat below steps:
> + * a. read data_offset from where user application should write data.
> + * b. write data of data_size to migration region from data_offset. Data size
> + *    should be data packet size at source during _SAVING.
> + * c. write data_size which indicates vendor driver that data is written in
> + *    migration region. Vendor driver should read this data from migration
> + *    region and resume device's state.
> + *
> + * For user application, data is opaque. User application should write data in
> + * the same order as received and should of same transaction size at source.
> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_STOP      (0)
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +				     VFIO_DEVICE_STATE_SAVING |  \
> +				     VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_VALID(state) \
> +	(state & VFIO_DEVICE_STATE_RESUMING ? \
> +	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> +
> +#define VFIO_DEVICE_STATE_ERROR			\
> +		(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING)
> +
> +	__u32 reserved;
> +	__u64 pending_bytes;
> +	__u64 data_offset;
> +	__u64 data_size;
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> -- 
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices
  2020-02-07 19:42 ` Kirti Wankhede
@ 2020-03-09  7:46   ` Zengtao (B)
  -1 siblings, 0 replies; 62+ messages in thread
From: Zengtao (B) @ 2020-03-09  7:46 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

Hi Kirti:

What kind of platform/IO are you using now to do the basic code
verification?

I just want to check if I can verify it on my platform, and if any open
IO cards available? 

Thanks.

Regards
Zengtao 

> -----Original Message-----
> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org]
> On Behalf Of Kirti Wankhede
> Sent: Saturday, February 08, 2020 3:42 AM
> To: alex.williamson@redhat.com; cjia@nvidia.com
> Cc: kevin.tian@intel.com; ziye.yang@intel.com;
> changpeng.liu@intel.com; yi.l.liu@intel.com; mlevitsk@redhat.com;
> eskultet@redhat.com; cohuck@redhat.com; dgilbert@redhat.com;
> jonathan.davies@nutanix.com; eauger@redhat.com; aik@ozlabs.ru;
> pasic@linux.ibm.com; felipe@nutanix.com;
> Zhengxiao.zx@Alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> Ken.Xue@amd.com; zhi.a.wang@intel.com; yan.y.zhao@intel.com;
> qemu-devel@nongnu.org; kvm@vger.kernel.org; Kirti Wankhede
> Subject: [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO
> devices
> 
> Hi,
> 
> This patch set adds:
> * New IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
>   respect to IOMMU container rather than per device. All pages pinned
> by
>   vendor driver through vfio_pin_pages external API has to be marked
> as
>   dirty during  migration. When IOMMU capable device is present in
> the
>   container and all pages are pinned and mapped, then all pages are
> marked
>   dirty.
>   When there are CPU writes, CPU dirty page tracking can identify
> dirtied
>   pages, but any page pinned by vendor driver can also be written by
>   device. As of now there is no device which has hardware support for
>   dirty page tracking. So all pages which are pinned should be
> considered
>   as dirty.
>   This ioctl is also used to start/stop dirty pages tracking for pinned and
>   unpinned pages while migration is active.
> 
> * Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap
> before
>   unmapping IO virtual address range.
>   With vIOMMU, during pre-copy phase of migration, while CPUs are
> still
>   running, IO virtual address unmap can happen while device still
> keeping
>   reference of guest pfns. Those pages should be reported as dirty
> before
>   unmap, so that VFIO user space application can copy content of those
>   pages from source to destination.
> 
> * Patch 7 is proposed change to detect if IOMMU capable device driver is
>   smart to report pages to be marked dirty by pinning pages using
>   vfio_pin_pages() API.
> 
> 
> Yet TODO:
> Since there is no device which has hardware support for system
> memmory
> dirty bitmap tracking, right now there is no other API from vendor driver
> to VFIO IOMMU module to report dirty pages. In future, when such
> hardware
> support will be implemented, an API will be required such that vendor
> driver could report dirty pages to VFIO module during migration phases.
> 
> Adding revision history from previous QEMU patch set to understand
> KABI
> changes done till now
> 
> v11 -> v12
> - Changed bitmap allocation in vfio_iommu_type1.
> - Remove atomicity of ref_count.
> - Updated comments for migration device state structure about error
>   reporting.
> - Nit picks from v11 reviews
> 
> v10 -> v11
> - Fix pin pages API to free vpfn if it is marked as unpinned tracking page.
> - Added proposal to detect if IOMMU capable device calls external pin
> pages
>   API to mark pages dirty.
> - Nit picks from v10 reviews
> 
> v9 -> v10:
> - Updated existing VFIO_IOMMU_UNMAP_DMA ioctl to get dirty pages
> bitmap
>   during unmap while migration is active
> - Added flag in VFIO_IOMMU_GET_INFO to indicate driver support dirty
> page
>   tracking.
> - If iommu_mapped, mark all pages dirty.
> - Added unpinned pages tracking while migration is active.
> - Updated comments for migration device state structure with bit
>   combination table and state transition details.
> 
> v8 -> v9:
> - Split patch set in 2 sets, Kernel and QEMU.
> - Dirty pages bitmap is queried from IOMMU container rather than from
>   vendor driver for per device. Added 2 ioctls to achieve this.
> 
> v7 -> v8:
> - Updated comments for KABI
> - Added BAR address validation check during PCI device's config space
> load
>   as suggested by Dr. David Alan Gilbert.
> - Changed vfio_migration_set_state() to set or clear device state flags.
> - Some nit fixes.
> 
> v6 -> v7:
> - Fix build failures.
> 
> v5 -> v6:
> - Fix build failure.
> 
> v4 -> v5:
> - Added decriptive comment about the sequence of access of members
> of
>   structure vfio_device_migration_info to be followed based on Alex's
>   suggestion
> - Updated get dirty pages sequence.
> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
>   get_object, save_config and load_config.
> - Fixed multiple nit picks.
> - Tested live migration with multiple vfio device assigned to a VM.
> 
> v3 -> v4:
> - Added one more bit for _RESUMING flag to be set explicitly.
> - data_offset field is read-only for user space application.
> - data_size is read for every iteration before reading data from
> migration,
>   that is removed assumption that data will be till end of migration
>   region.
> - If vendor driver supports mappable sparsed region, map those region
>   during setup state of save/load, similarly unmap those from cleanup
>   routines.
> - Handles race condition that causes data corruption in migration region
>   during save device state by adding mutex and serialiaing save_buffer
> and
>   get_dirty_pages routines.
> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> - Added trace events.
> - Split into multiple functional patches.
> 
> v2 -> v3:
> - Removed enum of VFIO device states. Defined VFIO device state with 2
>   bits.
> - Re-structured vfio_device_migration_info to keep it minimal and
> defined
>   action on read and write access on its members.
> 
> v1 -> v2:
> - Defined MIGRATION region type and sub-type which should be used
> with
>   region type capability.
> - Re-structured vfio_device_migration_info. This structure will be placed
>   at 0th offset of migration region.
> - Replaced ioctl with read/write for trapped part of migration region.
> - Added both type of access support, trapped or mmapped, for data
> section
>   of the region.
> - Moved PCI device functions to pci file.
> - Added iteration to get dirty page bitmap until bitmap for all requested
>   pages are copied.
> 
> Thanks,
> Kirti
> 
> 
> Kirti Wankhede (7):
>   vfio: KABI for migration interface for device state
>   vfio iommu: Remove atomicity of ref_count of pinned pages
>   vfio iommu: Add ioctl definition for dirty pages tracking.
>   vfio iommu: Implementation of ioctl to for dirty pages tracking.
>   vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before
> unmap
>   vfio iommu: Adds flag to indicate dirty pages tracking capability
>     support
>   vfio: Selective dirty page tracking if IOMMU backed device pins pages
> 
>  drivers/vfio/vfio.c             |  13 +-
>  drivers/vfio/vfio_iommu_type1.c | 435
> +++++++++++++++++++++++++++++++++++++---
>  include/linux/vfio.h            |   4 +-
>  include/uapi/linux/vfio.h       | 267 +++++++++++++++++++++++-
>  4 files changed, 692 insertions(+), 27 deletions(-)
> 
> --
> 2.7.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices
@ 2020-03-09  7:46   ` Zengtao (B)
  0 siblings, 0 replies; 62+ messages in thread
From: Zengtao (B) @ 2020-03-09  7:46 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: Zhengxiao.zx, kevin.tian, yi.l.liu, yan.y.zhao, kvm, eskultet,
	ziye.yang, qemu-devel, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue

Hi Kirti:

What kind of platform/IO are you using now to do the basic code
verification?

I just want to check if I can verify it on my platform, and if any open
IO cards available? 

Thanks.

Regards
Zengtao 

> -----Original Message-----
> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org]
> On Behalf Of Kirti Wankhede
> Sent: Saturday, February 08, 2020 3:42 AM
> To: alex.williamson@redhat.com; cjia@nvidia.com
> Cc: kevin.tian@intel.com; ziye.yang@intel.com;
> changpeng.liu@intel.com; yi.l.liu@intel.com; mlevitsk@redhat.com;
> eskultet@redhat.com; cohuck@redhat.com; dgilbert@redhat.com;
> jonathan.davies@nutanix.com; eauger@redhat.com; aik@ozlabs.ru;
> pasic@linux.ibm.com; felipe@nutanix.com;
> Zhengxiao.zx@Alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> Ken.Xue@amd.com; zhi.a.wang@intel.com; yan.y.zhao@intel.com;
> qemu-devel@nongnu.org; kvm@vger.kernel.org; Kirti Wankhede
> Subject: [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO
> devices
> 
> Hi,
> 
> This patch set adds:
> * New IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
>   respect to IOMMU container rather than per device. All pages pinned
> by
>   vendor driver through vfio_pin_pages external API has to be marked
> as
>   dirty during  migration. When IOMMU capable device is present in
> the
>   container and all pages are pinned and mapped, then all pages are
> marked
>   dirty.
>   When there are CPU writes, CPU dirty page tracking can identify
> dirtied
>   pages, but any page pinned by vendor driver can also be written by
>   device. As of now there is no device which has hardware support for
>   dirty page tracking. So all pages which are pinned should be
> considered
>   as dirty.
>   This ioctl is also used to start/stop dirty pages tracking for pinned and
>   unpinned pages while migration is active.
> 
> * Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap
> before
>   unmapping IO virtual address range.
>   With vIOMMU, during pre-copy phase of migration, while CPUs are
> still
>   running, IO virtual address unmap can happen while device still
> keeping
>   reference of guest pfns. Those pages should be reported as dirty
> before
>   unmap, so that VFIO user space application can copy content of those
>   pages from source to destination.
> 
> * Patch 7 is proposed change to detect if IOMMU capable device driver is
>   smart to report pages to be marked dirty by pinning pages using
>   vfio_pin_pages() API.
> 
> 
> Yet TODO:
> Since there is no device which has hardware support for system
> memmory
> dirty bitmap tracking, right now there is no other API from vendor driver
> to VFIO IOMMU module to report dirty pages. In future, when such
> hardware
> support will be implemented, an API will be required such that vendor
> driver could report dirty pages to VFIO module during migration phases.
> 
> Adding revision history from previous QEMU patch set to understand
> KABI
> changes done till now
> 
> v11 -> v12
> - Changed bitmap allocation in vfio_iommu_type1.
> - Remove atomicity of ref_count.
> - Updated comments for migration device state structure about error
>   reporting.
> - Nit picks from v11 reviews
> 
> v10 -> v11
> - Fix pin pages API to free vpfn if it is marked as unpinned tracking page.
> - Added proposal to detect if IOMMU capable device calls external pin
> pages
>   API to mark pages dirty.
> - Nit picks from v10 reviews
> 
> v9 -> v10:
> - Updated existing VFIO_IOMMU_UNMAP_DMA ioctl to get dirty pages
> bitmap
>   during unmap while migration is active
> - Added flag in VFIO_IOMMU_GET_INFO to indicate driver support dirty
> page
>   tracking.
> - If iommu_mapped, mark all pages dirty.
> - Added unpinned pages tracking while migration is active.
> - Updated comments for migration device state structure with bit
>   combination table and state transition details.
> 
> v8 -> v9:
> - Split patch set in 2 sets, Kernel and QEMU.
> - Dirty pages bitmap is queried from IOMMU container rather than from
>   vendor driver for per device. Added 2 ioctls to achieve this.
> 
> v7 -> v8:
> - Updated comments for KABI
> - Added BAR address validation check during PCI device's config space
> load
>   as suggested by Dr. David Alan Gilbert.
> - Changed vfio_migration_set_state() to set or clear device state flags.
> - Some nit fixes.
> 
> v6 -> v7:
> - Fix build failures.
> 
> v5 -> v6:
> - Fix build failure.
> 
> v4 -> v5:
> - Added decriptive comment about the sequence of access of members
> of
>   structure vfio_device_migration_info to be followed based on Alex's
>   suggestion
> - Updated get dirty pages sequence.
> - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
>   get_object, save_config and load_config.
> - Fixed multiple nit picks.
> - Tested live migration with multiple vfio device assigned to a VM.
> 
> v3 -> v4:
> - Added one more bit for _RESUMING flag to be set explicitly.
> - data_offset field is read-only for user space application.
> - data_size is read for every iteration before reading data from
> migration,
>   that is removed assumption that data will be till end of migration
>   region.
> - If vendor driver supports mappable sparsed region, map those region
>   during setup state of save/load, similarly unmap those from cleanup
>   routines.
> - Handles race condition that causes data corruption in migration region
>   during save device state by adding mutex and serialiaing save_buffer
> and
>   get_dirty_pages routines.
> - Skip called get_dirty_pages routine for mapped MMIO region of device.
> - Added trace events.
> - Split into multiple functional patches.
> 
> v2 -> v3:
> - Removed enum of VFIO device states. Defined VFIO device state with 2
>   bits.
> - Re-structured vfio_device_migration_info to keep it minimal and
> defined
>   action on read and write access on its members.
> 
> v1 -> v2:
> - Defined MIGRATION region type and sub-type which should be used
> with
>   region type capability.
> - Re-structured vfio_device_migration_info. This structure will be placed
>   at 0th offset of migration region.
> - Replaced ioctl with read/write for trapped part of migration region.
> - Added both type of access support, trapped or mmapped, for data
> section
>   of the region.
> - Moved PCI device functions to pci file.
> - Added iteration to get dirty page bitmap until bitmap for all requested
>   pages are copied.
> 
> Thanks,
> Kirti
> 
> 
> Kirti Wankhede (7):
>   vfio: KABI for migration interface for device state
>   vfio iommu: Remove atomicity of ref_count of pinned pages
>   vfio iommu: Add ioctl definition for dirty pages tracking.
>   vfio iommu: Implementation of ioctl to for dirty pages tracking.
>   vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before
> unmap
>   vfio iommu: Adds flag to indicate dirty pages tracking capability
>     support
>   vfio: Selective dirty page tracking if IOMMU backed device pins pages
> 
>  drivers/vfio/vfio.c             |  13 +-
>  drivers/vfio/vfio_iommu_type1.c | 435
> +++++++++++++++++++++++++++++++++++++---
>  include/linux/vfio.h            |   4 +-
>  include/uapi/linux/vfio.h       | 267 +++++++++++++++++++++++-
>  4 files changed, 692 insertions(+), 27 deletions(-)
> 
> --
> 2.7.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2020-03-09  7:47 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-07 19:42 [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices Kirti Wankhede
2020-02-07 19:42 ` Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-10 17:25   ` Alex Williamson
2020-02-10 17:25     ` Alex Williamson
2020-02-12 20:56     ` Kirti Wankhede
2020-02-12 20:56       ` Kirti Wankhede
2020-02-14 10:21   ` Cornelia Huck
2020-02-14 10:21     ` Cornelia Huck
2020-02-27  8:58   ` Yan Zhao
2020-02-27  8:58     ` Yan Zhao
2020-02-07 19:42 ` [PATCH v12 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to " Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-10  9:49   ` Yan Zhao
2020-02-10  9:49     ` Yan Zhao
2020-02-10 19:44     ` Alex Williamson
2020-02-10 19:44       ` Alex Williamson
2020-02-11  2:52       ` Yan Zhao
2020-02-11  2:52         ` Yan Zhao
2020-02-11  3:45         ` Alex Williamson
2020-02-11  3:45           ` Alex Williamson
2020-02-11  4:11           ` Yan Zhao
2020-02-11  4:11             ` Yan Zhao
2020-02-10 17:25   ` Alex Williamson
2020-02-10 17:25     ` Alex Williamson
2020-02-12 20:56     ` Kirti Wankhede
2020-02-12 20:56       ` Kirti Wankhede
2020-02-12 23:13       ` Alex Williamson
2020-02-12 23:13         ` Alex Williamson
2020-02-13 20:11         ` Kirti Wankhede
2020-02-13 20:11           ` Kirti Wankhede
2020-02-13 23:20           ` Alex Williamson
2020-02-13 23:20             ` Alex Williamson
2020-02-17 19:13             ` Kirti Wankhede
2020-02-17 19:13               ` Kirti Wankhede
2020-02-17 20:55               ` Alex Williamson
2020-02-17 20:55                 ` Alex Williamson
2020-02-18  5:58                 ` Kirti Wankhede
2020-02-18  5:58                   ` Kirti Wankhede
2020-02-18 21:41                   ` Alex Williamson
2020-02-18 21:41                     ` Alex Williamson
2020-02-19  4:21                     ` Kirti Wankhede
2020-02-19  4:21                       ` Kirti Wankhede
2020-02-19  4:53                       ` Alex Williamson
2020-02-19  4:53                         ` Alex Williamson
2020-02-07 19:42 ` [PATCH v12 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-10 17:48   ` Alex Williamson
2020-02-10 17:48     ` Alex Williamson
2020-02-07 19:42 ` [PATCH v12 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-10 18:14   ` Alex Williamson
2020-02-10 18:14     ` Alex Williamson
2020-03-09  7:46 ` [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices Zengtao (B)
2020-03-09  7:46   ` Zengtao (B)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.