kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 Kernel 0/6] KABIs to support migration for VFIO devices
@ 2019-12-17 17:10 Kirti Wankhede
  2019-12-17 17:10 ` [PATCH v11 Kernel 1/6] vfio: KABI for migration interface for device state Kirti Wankhede
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Kirti Wankhede @ 2019-12-17 17:10 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Hi,

This patch set adds:
* New IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
  respect to IOMMU container rather than per device. All pages pinned by
  vendor driver through vfio_pin_pages external API has to be marked as
  dirty during  migration. When IOMMU capable device is present in the
  container and all pages are pinned and mapped, then all pages are marked
  dirty.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned should be considered
  as dirty.
  This ioctl is also used to start dirty pages tracking for unpinned pages
  while migration is active and device is running. These tracked unpinned
  pages information is cleaned on dirty bitmap read from VFIO application
  or if migration is failed or cancelled and unpinned pages tracking is
  stopped.
* Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap before
  unmapping IO virtual address range.
  With vIOMMU, during pre-copy phase of migration, while CPUs are still
  running, IO virtual address unmap can happen while device still keeping
  reference of guest pfns. Those pages should be reported as dirty before
  unmap, so that VFIO user space application can copy content of those
  pages from source to destination.

* Patch 6 is proposed change to detect if IOMMU capable device driver is
  smart to report pages to be marked dirty by pinning pages using
  vfio_pin_pages() API.


Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required such that vendor
driver could report dirty pages to VFIO module during migration phases.

Adding revision history from previous QEMU patch set to understand KABI
changes done till now

v10 -> v11
- Fix pin pages API to free vpfn if it is marked as unpinned tracking page.
- Added proposal to detect if IOMMU capable device calls external pin pages
  API to mark pages dirty.
- Nit picks from v10 reviews

v9 -> v10:
- Updated existing VFIO_IOMMU_UNMAP_DMA ioctl to get dirty pages bitmap
  during unmap while migration is active
- Added flag in VFIO_IOMMU_GET_INFO to indicate driver support dirty page
  tracking.
- If iommu_mapped, mark all pages dirty.
- Added unpinned pages tracking while migration is active.
- Updated comments for migration device state structure with bit
  combination table and state transition details.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load
  as suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of
  structure vfio_device_migration_info to be followed based on Alex's
  suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration,
  that is removed assumption that data will be till end of migration
  region.
- If vendor driver supports mappable sparsed region, map those region
  during setup state of save/load, similarly unmap those from cleanup
  routines.
- Handles race condition that causes data corruption in migration region
  during save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Split into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2
  bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined
  action on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with
  region type capability.
- Re-structured vfio_device_migration_info. This structure will be placed
  at 0th offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section
  of the region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested
  pages are copied.

Thanks,
Kirti

Kirti Wankhede (6):
  vfio: KABI for migration interface for device state
  vfio iommu: Add ioctl definition for dirty pages tracking.
  vfio iommu: Implementation of ioctl to for dirty pages tracking.
  vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
  vfio iommu: Adds flag to indicate dirty pages tracking capability
    support
  vfio: Selective dirty page tracking if IOMMU backed device pins pages

 drivers/vfio/vfio.c             |  33 ++++
 drivers/vfio/vfio_iommu_type1.c | 324 +++++++++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |   3 +-
 include/uapi/linux/vfio.h       | 247 +++++++++++++++++++++++++++++-
 4 files changed, 587 insertions(+), 20 deletions(-)

-- 
2.7.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v11 Kernel 1/6] vfio: KABI for migration interface for device state
  2019-12-17 17:10 [PATCH v11 Kernel 0/6] KABIs to support migration for VFIO devices Kirti Wankhede
@ 2019-12-17 17:10 ` Kirti Wankhede
  2019-12-17 17:10 ` [PATCH v11 Kernel 2/6] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Kirti Wankhede @ 2019-12-17 17:10 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

- Defined MIGRATION region type and sub-type.

- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access.

- Defined device states and added state transition details in the comment.

- Added sequence to be followed while saving and resuming VFIO device state

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 187 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 187 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..b7ac8f7c0d3c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
+#define VFIO_REGION_TYPE_MIGRATION              (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -379,6 +380,192 @@ struct vfio_region_gfx_edid {
 /* sub-types for VFIO_REGION_TYPE_CCW */
 #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
 
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
+
+/*
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      To indicate vendor driver the state VFIO device should be transitioned
+ *      to. If device state transition fails, write on this field return error.
+ *      It consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When it's clear, that
+ *	  indicates _STOP state. When device is changed to _STOP, driver should
+ *	  stop device before write() returns.
+ *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
+ *        should start gathering device state information which will be provided
+ *        to VFIO user space application to save device's state.
+ *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
+ *        prepare to resume device, data provided through migration region
+ *        should be used to resume device.
+ *      Bits 3 - 31 are reserved for future use. User should perform
+ *      read-modify-write operation on this field.
+ *
+ *  +------- _RESUMING
+ *  |+------ _SAVING
+ *  ||+----- _RUNNING
+ *  |||
+ *  000b => Device Stopped, not saving or resuming
+ *  001b => Device running state, default state
+ *  010b => Stop Device & save device state, stop-and-copy state
+ *  011b => Device running and save device state, pre-copy state
+ *  100b => Device stopped and device state is resuming
+ *  101b => Invalid state
+ *  110b => Invalid state
+ *  111b => Invalid state
+ *
+ * State transitions:
+ *
+ *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
+ *                (100b)     (001b)     (011b)        (010b)       (000b)
+ * 0. Running or Default state
+ *                             |
+ *
+ * 1. Normal Shutdown (optional)
+ *                             |------------------------------------->|
+ *
+ * 2. Save state or Suspend
+ *                             |------------------------->|---------->|
+ *
+ * 3. Save state during live migration
+ *                             |----------->|------------>|---------->|
+ *
+ * 4. Resuming
+ *                  |<---------|
+ *
+ * 5. Resumed
+ *                  |--------->|
+ *
+ * 0. Default state of VFIO device is _RUNNNG when VFIO application starts.
+ * 1. During normal VFIO application shutdown, vfio device state changes
+ *    from _RUNNING to _STOP. This is optional, user space application may or
+ *    may not perform this state transition and vendor driver may not need.
+ * 2. When VFIO application save state or suspend application, VFIO device
+ *    state transition is from _RUNNING to stop-and-copy state and then to
+ *    _STOP.
+ *    On state transition from _RUNNING to stop-and-copy, driver must
+ *    stop device, save device state and send it to application through
+ *    migration region.
+ *    On _RUNNING to stop-and-copy state transition failure, application should
+ *    set VFIO device state to _RUNNING.
+ * 3. In VFIO application live migration, state transition is from _RUNNING
+ *    to pre-copy to stop-and-copy to _STOP.
+ *    On state transition from _RUNNING to pre-copy, driver should start
+ *    gathering device state while application is still running and send device
+ *    state data to application through migration region.
+ *    On state transition from pre-copy to stop-and-copy, driver must stop
+ *    device, save device state and send it to application through migration
+ *    region.
+ *    On any failure during any of these state transition, VFIO device state
+ *    should be set to _RUNNING.
+ * 4. To start resuming phase, VFIO device state should be transitioned from
+ *    _RUNNING to _RESUMING state.
+ *    In _RESUMING state, driver should use received device state data through
+ *    migration region to resume device.
+ *    On failure during this state transition, application should set _RUNNING
+ *    state.
+ * 5. On providing saved device data to driver, appliation should change state
+ *    from _RESUMING to _RUNNING.
+ *    On failure to transition to _RUNNING state, VFIO application should reset
+ *    the device and set _RUNNING state so that device doesn't remain in unknown
+ *    or bad state. On reset, driver must reset device and device should be
+ *    available in default initial state, _RUNNING.
+ *
+ * pending bytes: (read only)
+ *      Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read device data during _SAVING state or write
+ *      device data during _RESUMING state. See below for detail of sequence to
+ *      be followed.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to get size of data copied in
+ *      bytes in migration region during _SAVING state and write size of data
+ *      copied in bytes in migration region during _RESUMING state.
+ *
+ * Migration region looks like:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^
+ *  offset 0-trapped part        data_offset
+ *
+ * Structure vfio_device_migration_info is always followed by data section in
+ * the region, so data_offset will always be non-0. Offset from where data is
+ * copied is decided by kernel driver, data section can be trapped or mapped
+ * or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned. The user is not required to access via mmap regardless
+ * of the region mmap capabilities.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes, indicates start of new iteration to get device data.
+ *    If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates kernel driver to make data available through
+ *    data section. Kernel driver should return this read operation only after
+ *    data is available from (region + data_offset) to (region + data_offset +
+ *    data_size).
+ * c. read data_size, amount of data in bytes available through migration
+ *    region.
+ * d. read data of data_size bytes from (region + data_offset) from migration
+ *    region.
+ * e. process data.
+ * f. read pending_bytes, this read operation indicates data from previous
+ *    iteration had read. If pending_bytes > 0, goto step b.
+ *
+ * User can transition from _SAVING|_RUNNING (pre-copy state) to _SAVING
+ * (stop-and-copy) state regardless of pending bytes.
+ * User should iterate in _SAVING (stop-and-copy) until pending_bytes is 0.
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
+ * b. write data of data_size to migration region from data_offset. Data size
+ *    could be data packet size at source during _SAVING or migration region
+ *    data section size which ever is less.
+ * c. write data_size which indicates vendor driver that data is written in
+ *    staging buffer. Vendor driver should read this data from migration
+ *    region and resume device's state.
+ *
+ * For user application, data is opaque. User should write data in the same
+ * order as received.
+ */
+
+struct vfio_device_migration_info {
+	__u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_STOP      (0)
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+				     VFIO_DEVICE_STATE_SAVING |  \
+				     VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
+					    VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
+					    VFIO_DEVICE_STATE_RESUMING)
+	__u32 reserved;
+	__u64 pending_bytes;
+	__u64 data_offset;
+	__u64 data_size;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v11 Kernel 2/6] vfio iommu: Add ioctl definition for dirty pages tracking.
  2019-12-17 17:10 [PATCH v11 Kernel 0/6] KABIs to support migration for VFIO devices Kirti Wankhede
  2019-12-17 17:10 ` [PATCH v11 Kernel 1/6] vfio: KABI for migration interface for device state Kirti Wankhede
@ 2019-12-17 17:10 ` Kirti Wankhede
  2019-12-17 17:10 ` [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to " Kirti Wankhede
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Kirti Wankhede @ 2019-12-17 17:10 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

IOMMU container maintains a list of all pages pinned by vfio_pin_pages API.
All pages pinned by vendor driver through this API should be considered as
dirty during migration. When container consists of IOMMU capable device and
all pages are pinned and mapped, then all pages are marked dirty.
Added support to start/stop unpinned pages tracking and to get bitmap of
all dirtied pages for requested IO virtual address range. Unpinned page
tracking is cleared either when bitmap is read by user application or
unpinned page tracking is stopped.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b7ac8f7c0d3c..8268634e7e08 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -981,6 +981,49 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_DIRTY_PAGES - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
+ *                                     struct vfio_iommu_type1_dirty_bitmap)
+ * IOCTL is used for dirty pages tracking. Caller sets argsz, which is size of
+ * struct vfio_iommu_type1_dirty_bitmap. Caller set flag depend on which
+ * operation to perform, details as below:
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_START set, indicates
+ * migration is active and IOMMU module should track pages which are being
+ * unpinned. Unpinned pages are tracked until bitmap for that range is queried
+ * or tracking is stopped by user application by setting
+ * VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP flag.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP set, indicates
+ * IOMMU should stop tracking unpinned pages and also free previously tracked
+ * unpinned pages data.
+ *
+ * When IOCTL is called with VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP flag set,
+ * IOCTL returns dirty pages bitmap for IOMMU container during migration for
+ * given IOVA range. User must allocate memory to get bitmap, zero the bitmap
+ * memory and set size of allocated memory in bitmap_size field. One bit is
+ * used to represent one page consecutively starting from iova offset. User
+ * should provide page size in 'pgsize'. Bit set in bitmap indicates page at
+ * that offset from iova is dirty.
+ *
+ * Only one flag should be set at a time.
+ *
+ */
+struct vfio_iommu_type1_dirty_bitmap {
+	__u32        argsz;
+	__u32        flags;
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_START	(1 << 0)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP	(1 << 1)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP	(1 << 2)
+	__u64        iova;		/* IO virtual address */
+	__u64        size;		/* Size of iova range */
+	__u64	     pgsize;		/* page size for bitmap */
+	__u64        bitmap_size;	/* in bytes */
+	void __user *bitmap;		/* one bit per page */
+};
+
+#define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-17 17:10 [PATCH v11 Kernel 0/6] KABIs to support migration for VFIO devices Kirti Wankhede
  2019-12-17 17:10 ` [PATCH v11 Kernel 1/6] vfio: KABI for migration interface for device state Kirti Wankhede
  2019-12-17 17:10 ` [PATCH v11 Kernel 2/6] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
@ 2019-12-17 17:10 ` Kirti Wankhede
  2019-12-17 22:12   ` Alex Williamson
  2019-12-17 17:10 ` [PATCH v11 Kernel 4/6] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Kirti Wankhede @ 2019-12-17 17:10 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
- Start unpinned pages dirty pages tracking while migration is active and
  device is running, i.e. during pre-copy phase.
- Stop unpinned pages dirty pages tracking. This is required to stop
  unpinned dirty pages tracking if migration failed or cancelled during
  pre-copy phase. Unpinned pages tracking is clear.
- Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
  unpinned pages information on bitmap read. This ioctl returns bitmap of
  dirty pages, its user space application responsibility to copy content
  of dirty pages from source to destination during migration.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 218 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 209 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..215aecb25453 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -70,6 +70,7 @@ struct vfio_iommu {
 	unsigned int		dma_avail;
 	bool			v2;
 	bool			nesting;
+	bool			dirty_page_tracking;
 };
 
 struct vfio_domain {
@@ -112,6 +113,7 @@ struct vfio_pfn {
 	dma_addr_t		iova;		/* Device address */
 	unsigned long		pfn;		/* Host pfn */
 	atomic_t		ref_count;
+	bool			unpinned;
 };
 
 struct vfio_regions {
@@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
 	kfree(vpfn);
 }
 
+static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
+{
+	struct rb_node *n = rb_first(&dma->pfn_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
+
+		if (warn)
+			WARN_ON_ONCE(vpfn->unpinned);
+
+		if (vpfn->unpinned)
+			vfio_remove_from_pfn_list(dma, vpfn);
+	}
+}
+
+static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
+{
+	struct rb_node *n = rb_first(&iommu->dma_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		vfio_remove_unpinned_from_pfn_list(dma, false);
+	}
+}
+
 static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
 					       unsigned long iova)
 {
@@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
 	return vpfn;
 }
 
-static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
+static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
+				  bool dirty_tracking)
 {
 	int ret = 0;
 
 	if (atomic_dec_and_test(&vpfn->ref_count)) {
 		ret = put_pfn(vpfn->pfn, dma->prot);
-		vfio_remove_from_pfn_list(dma, vpfn);
+		if (dirty_tracking)
+			vpfn->unpinned = true;
+		else
+			vfio_remove_from_pfn_list(dma, vpfn);
 	}
 	return ret;
 }
@@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 }
 
 static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
-				    bool do_accounting)
+				    bool do_accounting, bool dirty_tracking)
 {
 	int unlocked;
 	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
@@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
 	if (!vpfn)
 		return 0;
 
-	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
+	if (vpfn->unpinned)
+		return 0;
+
+	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
 
 	if (do_accounting)
 		vfio_lock_acct(dma, -unlocked, true);
@@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
 		if (vpfn) {
-			phys_pfn[i] = vpfn->pfn;
-			continue;
+			if (vpfn->unpinned)
+				vfio_remove_from_pfn_list(dma, vpfn);
+			else {
+				phys_pfn[i] = vpfn->pfn;
+				continue;
+			}
 		}
 
 		remote_vaddr = dma->vaddr + iova - dma->iova;
@@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
 		if (ret) {
-			vfio_unpin_page_external(dma, iova, do_accounting);
+			vfio_unpin_page_external(dma, iova, do_accounting,
+						 false);
 			goto pin_unwind;
 		}
 	}
@@ -598,7 +638,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 
 		iova = user_pfn[j] << PAGE_SHIFT;
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
-		vfio_unpin_page_external(dma, iova, do_accounting);
+		vfio_unpin_page_external(dma, iova, do_accounting, false);
 		phys_pfn[j] = 0;
 	}
 pin_done:
@@ -632,7 +672,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
 		if (!dma)
 			goto unpin_exit;
-		vfio_unpin_page_external(dma, iova, do_accounting);
+		vfio_unpin_page_external(dma, iova, do_accounting,
+					 iommu->dirty_page_tracking);
 	}
 
 unpin_exit:
@@ -850,6 +891,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 	return bitmap;
 }
 
+/*
+ * start_iova is the reference from where bitmaping started. This is called
+ * from DMA_UNMAP where start_iova can be different than iova
+ */
+
+static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
+				  size_t size, uint64_t pgsize,
+				  dma_addr_t start_iova, unsigned long *bitmap)
+{
+	struct vfio_dma *dma;
+	dma_addr_t i = iova;
+	unsigned long pgshift = __ffs(pgsize);
+
+	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
+		/* mark all pages dirty if all pages are pinned and mapped. */
+		if (dma->iommu_mapped) {
+			dma_addr_t iova_limit;
+
+			iova_limit = (dma->iova + dma->size) < (iova + size) ?
+				     (dma->iova + dma->size) : (iova + size);
+
+			for (; i < iova_limit; i += pgsize) {
+				unsigned int start;
+
+				start = (i - start_iova) >> pgshift;
+
+				__bitmap_set(bitmap, start, 1);
+			}
+			if (i >= iova + size)
+				return;
+		} else {
+			struct rb_node *n = rb_first(&dma->pfn_list);
+			bool found = false;
+
+			for (; n; n = rb_next(n)) {
+				struct vfio_pfn *vpfn = rb_entry(n,
+							struct vfio_pfn, node);
+				if (vpfn->iova >= i) {
+					found = true;
+					break;
+				}
+			}
+
+			if (!found) {
+				i += dma->size;
+				continue;
+			}
+
+			for (; n; n = rb_next(n)) {
+				unsigned int start;
+				struct vfio_pfn *vpfn = rb_entry(n,
+							struct vfio_pfn, node);
+
+				if (vpfn->iova >= iova + size)
+					return;
+
+				start = (vpfn->iova - start_iova) >> pgshift;
+
+				__bitmap_set(bitmap, start, 1);
+
+				i = vpfn->iova + pgsize;
+			}
+		}
+		vfio_remove_unpinned_from_pfn_list(dma, false);
+	}
+}
+
+static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
+{
+	long bsize;
+
+	if (!bitmap_size || bitmap_size > SIZE_MAX)
+		return -EINVAL;
+
+	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
+
+	if (bitmap_size < bsize)
+		return -EINVAL;
+
+	return bsize;
+}
+
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 			     struct vfio_iommu_type1_dma_unmap *unmap)
 {
@@ -2297,6 +2420,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
+		struct vfio_iommu_type1_dirty_bitmap range;
+		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
+				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
+				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+		int ret;
+
+		if (!iommu->v2)
+			return -EACCES;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
+				    bitmap);
+
+		if (copy_from_user(&range, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (range.argsz < minsz || range.flags & ~mask)
+			return -EINVAL;
+
+		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
+			iommu->dirty_page_tracking = true;
+			return 0;
+		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
+			iommu->dirty_page_tracking = false;
+
+			mutex_lock(&iommu->lock);
+			vfio_remove_unpinned_from_dma_list(iommu);
+			mutex_unlock(&iommu->lock);
+			return 0;
+
+		} else if (range.flags &
+				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
+			uint64_t iommu_pgmask;
+			unsigned long pgshift = __ffs(range.pgsize);
+			unsigned long *bitmap;
+			long bsize;
+
+			iommu_pgmask =
+			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
+
+			if (((range.pgsize - 1) & iommu_pgmask) !=
+			    (range.pgsize - 1))
+				return -EINVAL;
+
+			if (range.iova & iommu_pgmask)
+				return -EINVAL;
+			if (!range.size || range.size > SIZE_MAX)
+				return -EINVAL;
+			if (range.iova + range.size < range.iova)
+				return -EINVAL;
+
+			bsize = verify_bitmap_size(range.size >> pgshift,
+						   range.bitmap_size);
+			if (bsize < 0)
+				return ret;
+
+			bitmap = kmalloc(bsize, GFP_KERNEL);
+			if (!bitmap)
+				return -ENOMEM;
+
+			ret = copy_from_user(bitmap,
+			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
+			if (ret)
+				goto bitmap_exit;
+
+			iommu->dirty_page_tracking = false;
+			mutex_lock(&iommu->lock);
+			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
+					     range.pgsize, range.iova, bitmap);
+			mutex_unlock(&iommu->lock);
+
+			ret = copy_to_user((void __user *)range.bitmap, bitmap,
+					   range.bitmap_size) ? -EFAULT : 0;
+bitmap_exit:
+			kfree(bitmap);
+			return ret;
+		}
 	}
 
 	return -ENOTTY;
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v11 Kernel 4/6] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
  2019-12-17 17:10 [PATCH v11 Kernel 0/6] KABIs to support migration for VFIO devices Kirti Wankhede
                   ` (2 preceding siblings ...)
  2019-12-17 17:10 ` [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to " Kirti Wankhede
@ 2019-12-17 17:10 ` Kirti Wankhede
  2019-12-17 22:55   ` Alex Williamson
  2019-12-17 17:10 ` [PATCH v11 Kernel 5/6] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
  2019-12-17 17:10 ` [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
  5 siblings, 1 reply; 20+ messages in thread
From: Kirti Wankhede @ 2019-12-17 17:10 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Pages, pinned by external interface for requested IO virtual address
range,  might get unpinned  and unmapped while migration is active and
device is still running, that is, in pre-copy phase while guest driver
still could access those pages. Host device can write to these pages while
those were mapped. Such pages should be marked dirty so that after
migration guest driver should still be able to complete the operation.

To get bitmap during unmap, user should set flag
VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
zeroed by user space application. Bitmap size and page size should be set
by user application.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 63 ++++++++++++++++++++++++++++++++++++-----
 include/uapi/linux/vfio.h       | 12 ++++++++
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 215aecb25453..101c2b1e72b4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -974,7 +974,8 @@ static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
 }
 
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
-			     struct vfio_iommu_type1_dma_unmap *unmap)
+			     struct vfio_iommu_type1_dma_unmap *unmap,
+			     unsigned long *bitmap)
 {
 	uint64_t mask;
 	struct vfio_dma *dma, *dma_last = NULL;
@@ -1049,6 +1050,15 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (dma->task->mm != current->mm)
 			break;
 
+		if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
+		    (dma_last != dma))
+			vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size,
+					     unmap->bitmap_pgsize, unmap->iova,
+					     bitmap);
+		else
+			vfio_remove_unpinned_from_pfn_list(dma, true);
+
+
 		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
 			struct vfio_iommu_type1_dma_unmap nb_unmap;
 
@@ -1074,6 +1084,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 						    &nb_unmap);
 			goto again;
 		}
+
 		unmapped += dma->size;
 		vfio_remove_dma(iommu, dma);
 	}
@@ -2404,22 +2415,60 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
 		struct vfio_iommu_type1_dma_unmap unmap;
-		long ret;
+		unsigned long *bitmap = NULL;
+		long ret, bsize;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
 
-		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+		if (copy_from_user(&unmap, (void __user *)arg, sizeof(unmap)))
 			return -EFAULT;
 
-		if (unmap.argsz < minsz || unmap.flags)
+		if (unmap.argsz < minsz ||
+		    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
 			return -EINVAL;
 
-		ret = vfio_dma_do_unmap(iommu, &unmap);
+		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+			unsigned long pgshift = __ffs(unmap.bitmap_pgsize);
+			uint64_t iommu_pgmask =
+			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
+
+			if (((unmap.bitmap_pgsize - 1) & iommu_pgmask) !=
+			     (unmap.bitmap_pgsize - 1))
+				return -EINVAL;
+
+			bsize = verify_bitmap_size(unmap.size >> pgshift,
+						   unmap.bitmap_size);
+			if (bsize < 0)
+				return bsize;
+
+			bitmap = kmalloc(bsize, GFP_KERNEL);
+			if (!bitmap)
+				return -ENOMEM;
+
+			if (copy_from_user(bitmap, (void __user *)unmap.bitmap,
+					   bsize)) {
+				ret = -EFAULT;
+				goto unmap_exit;
+			}
+		}
+
+		ret = vfio_dma_do_unmap(iommu, &unmap, bitmap);
 		if (ret)
-			return ret;
+			goto unmap_exit;
 
-		return copy_to_user((void __user *)arg, &unmap, minsz) ?
+		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+			if (copy_to_user((void __user *)unmap.bitmap, bitmap,
+					  bsize)) {
+				ret = -EFAULT;
+				goto unmap_exit;
+			}
+		}
+
+		ret = copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+unmap_exit:
+		kfree(bitmap);
+		return ret;
 	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
 		struct vfio_iommu_type1_dirty_bitmap range;
 		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8268634e7e08..e8e044c4974d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -964,12 +964,24 @@ struct vfio_iommu_type1_dma_map {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
+ * before unmapping IO virtual addresses. When this flag is set, user should
+ * allocate memory to get bitmap, clear the bitmap memory by setting zero and
+ * should set size of allocated memory in bitmap_size field. One bit in bitmap
+ * represents per page , page of user provided page size in 'bitmap_pgsize',
+ * consecutively starting from iova offset. Bit set indicates page at that
+ * offset from iova is dirty. Bitmap of pages in the range of unmapped size is
+ * returned in bitmap.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
+	__u64        bitmap_pgsize;		/* page size for bitmap */
+	__u64        bitmap_size;               /* in bytes */
+	void __user *bitmap;                    /* one bit per page */
 };
 
 #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v11 Kernel 5/6] vfio iommu: Adds flag to indicate dirty pages tracking capability support
  2019-12-17 17:10 [PATCH v11 Kernel 0/6] KABIs to support migration for VFIO devices Kirti Wankhede
                   ` (3 preceding siblings ...)
  2019-12-17 17:10 ` [PATCH v11 Kernel 4/6] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
@ 2019-12-17 17:10 ` Kirti Wankhede
  2019-12-17 17:10 ` [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
  5 siblings, 0 replies; 20+ messages in thread
From: Kirti Wankhede @ 2019-12-17 17:10 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Flag VFIO_IOMMU_INFO_DIRTY_PGS in VFIO_IOMMU_GET_INFO indicates that driver
support dirty pages tracking.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 3 ++-
 include/uapi/linux/vfio.h       | 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 101c2b1e72b4..68d8ed3b2665 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2368,7 +2368,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 			info.cap_offset = 0; /* output, no-recopy necessary */
 		}
 
-		info.flags = VFIO_IOMMU_INFO_PGSIZES;
+		info.flags = VFIO_IOMMU_INFO_PGSIZES |
+			     VFIO_IOMMU_INFO_DIRTY_PGS;
 
 		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index e8e044c4974d..bdd07e8429e3 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -907,8 +907,9 @@ struct vfio_device_ioeventfd {
 struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
-#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
-#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
+#define VFIO_IOMMU_INFO_PGSIZES   (1 << 0) /* supported page sizes info */
+#define VFIO_IOMMU_INFO_CAPS      (1 << 1) /* Info supports caps */
+#define VFIO_IOMMU_INFO_DIRTY_PGS (1 << 2) /* supports dirty page tracking */
 	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
 };
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages
  2019-12-17 17:10 [PATCH v11 Kernel 0/6] KABIs to support migration for VFIO devices Kirti Wankhede
                   ` (4 preceding siblings ...)
  2019-12-17 17:10 ` [PATCH v11 Kernel 5/6] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
@ 2019-12-17 17:10 ` Kirti Wankhede
  2019-12-18  0:12   ` Alex Williamson
  5 siblings, 1 reply; 20+ messages in thread
From: Kirti Wankhede @ 2019-12-17 17:10 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm, Kirti Wankhede

Track dirty pages reporting capability for each vfio_device by setting the
capability flag on calling vfio_pin_pages() for that device.

In vfio_iommu_type1 module, while creating dirty pages bitmap, check if
IOMMU backed device is present in the container. If IOMMU backed device is
present in container then check dirty pages reporting capability for each
vfio device in the container. If all vfio devices are capable of reporing
dirty pages tracking by pinning pages through external API, then report
create bitmap of pinned pages only. If IOMMU backed device is present in
the container and any one device is not able to report dirty pages, then
marked all pages as dirty.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vfio/vfio.c             | 33 +++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_type1.c | 44 +++++++++++++++++++++++++++++++++++++++--
 include/linux/vfio.h            |  3 ++-
 3 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c8482624ca34..9d2fbe09768a 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -96,6 +96,8 @@ struct vfio_device {
 	struct vfio_group		*group;
 	struct list_head		group_next;
 	void				*device_data;
+	/* dirty pages reporting capable */
+	bool				dirty_pages_cap;
 };
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -1866,6 +1868,29 @@ int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
 }
 EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
 
+int vfio_device_is_dirty_reporting_capable(struct device *dev, bool *cap)
+{
+	struct vfio_device *device;
+	struct vfio_group *group;
+
+	if (!dev || !cap)
+		return -EINVAL;
+
+	group = vfio_group_get_from_dev(dev);
+	if (!group)
+		return -ENODEV;
+
+	device = vfio_group_get_device(group, dev);
+	if (!device)
+		return -ENODEV;
+
+	*cap = device->dirty_pages_cap;
+	vfio_device_put(device);
+	vfio_group_put(group);
+	return 0;
+}
+EXPORT_SYMBOL(vfio_device_is_dirty_reporting_capable);
+
 /*
  * Pin a set of guest PFNs and return their associated host PFNs for local
  * domain only.
@@ -1907,6 +1932,14 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
 	else
 		ret = -ENOTTY;
 
+	if (ret > 0) {
+		struct vfio_device *device = vfio_group_get_device(group, dev);
+
+		if (device) {
+			device->dirty_pages_cap = true;
+			vfio_device_put(device);
+		}
+	}
 	vfio_group_try_dissolve_container(group);
 
 err_pin_pages:
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 68d8ed3b2665..ef56f31f4e73 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -891,6 +891,39 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 	return bitmap;
 }
 
+static int vfio_is_dirty_pages_reporting_capable(struct device *dev, void *data)
+{
+	bool new;
+	int ret;
+
+	ret = vfio_device_is_dirty_reporting_capable(dev, &new);
+	if (ret)
+		return ret;
+
+	*(bool *)data = *(bool *)data && new;
+
+	return 0;
+}
+
+static bool vfio_dirty_pages_reporting_capable(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *d;
+	struct vfio_group *g;
+	bool capable = true;
+	int ret;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		list_for_each_entry(g, &d->group_list, next) {
+			ret = iommu_group_for_each_dev(g->iommu_group, &capable,
+					vfio_is_dirty_pages_reporting_capable);
+			if (ret)
+				return false;
+		}
+	}
+
+	return capable;
+}
+
 /*
  * start_iova is the reference from where bitmaping started. This is called
  * from DMA_UNMAP where start_iova can be different than iova
@@ -903,10 +936,17 @@ static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
 	struct vfio_dma *dma;
 	dma_addr_t i = iova;
 	unsigned long pgshift = __ffs(pgsize);
+	bool dirty_report_cap = true;
+
+	if (IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+		dirty_report_cap = vfio_dirty_pages_reporting_capable(iommu);
 
 	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
-		/* mark all pages dirty if all pages are pinned and mapped. */
-		if (dma->iommu_mapped) {
+		/*
+		 * mark all pages dirty if any IOMMU capable device is not able
+		 * to report dirty pages and all pages are pinned and mapped.
+		 */
+		if (!dirty_report_cap && dma->iommu_mapped) {
 			dma_addr_t iova_limit;
 
 			iova_limit = (dma->iova + dma->size) < (iova + size) ?
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e42a711a2800..ed3832ea10a1 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -148,7 +148,8 @@ extern int vfio_info_add_capability(struct vfio_info_cap *caps,
 extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
 					      int num_irqs, int max_irq_type,
 					      size_t *data_size);
-
+extern int vfio_device_is_dirty_reporting_capable(struct device *dev,
+						  bool *cap);
 struct pci_dev;
 #if IS_ENABLED(CONFIG_VFIO_SPAPR_EEH)
 extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-17 17:10 ` [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to " Kirti Wankhede
@ 2019-12-17 22:12   ` Alex Williamson
  2020-01-07 20:07     ` Kirti Wankhede
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2019-12-17 22:12 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 17 Dec 2019 22:40:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> - Start unpinned pages dirty pages tracking while migration is active and
>   device is running, i.e. during pre-copy phase.
> - Stop unpinned pages dirty pages tracking. This is required to stop
>   unpinned dirty pages tracking if migration failed or cancelled during
>   pre-copy phase. Unpinned pages tracking is clear.
> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
>   unpinned pages information on bitmap read. This ioctl returns bitmap of
>   dirty pages, its user space application responsibility to copy content
>   of dirty pages from source to destination during migration.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 218 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 209 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ada8e6cdb88..215aecb25453 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
>  	unsigned int		dma_avail;
>  	bool			v2;
>  	bool			nesting;
> +	bool			dirty_page_tracking;
>  };
>  
>  struct vfio_domain {
> @@ -112,6 +113,7 @@ struct vfio_pfn {
>  	dma_addr_t		iova;		/* Device address */
>  	unsigned long		pfn;		/* Host pfn */
>  	atomic_t		ref_count;
> +	bool			unpinned;

Doesn't this duplicate ref_count == 0?

>  };
>  
>  struct vfio_regions {
> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>  	kfree(vpfn);
>  }
>  
> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
> +{
> +	struct rb_node *n = rb_first(&dma->pfn_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> +
> +		if (warn)
> +			WARN_ON_ONCE(vpfn->unpinned);

This option isn't used within this patch, perhaps better to add with
its use case, but it seems this presents both a denial of service via
kernel tainting and an undocumented feature/bug.  As I interpret its
use within the next patch, this generates a warning if the user
unmapped the IOVA with dirty pages present, without using the dirty
bitmap extension of the unmap call.  Our job is not to babysit the
user, if they don't care to look at the dirty bitmap, that's their
prerogative.  Drop this warning and the function arg.

> +
> +		if (vpfn->unpinned)

if (!atomic_read(&vpfn->ref_count))

> +			vfio_remove_from_pfn_list(dma, vpfn);
> +	}
> +}
> +
> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> +{
> +	struct rb_node *n = rb_first(&iommu->dma_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> +	}
> +}
> +
>  static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>  					       unsigned long iova)
>  {
> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>  	return vpfn;
>  }
>  
> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
> +				  bool dirty_tracking)
>  {
>  	int ret = 0;
>  
>  	if (atomic_dec_and_test(&vpfn->ref_count)) {
>  		ret = put_pfn(vpfn->pfn, dma->prot);
> -		vfio_remove_from_pfn_list(dma, vpfn);
> +		if (dirty_tracking)
> +			vpfn->unpinned = true;
> +		else
> +			vfio_remove_from_pfn_list(dma, vpfn);

This can also simply use ref_count.  BTW, checking the locking, I think
->ref_count is only manipulated under iommu->lock, therefore the atomic
ops are probably overkill.

>  	}
>  	return ret;
>  }
> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>  }
>  
>  static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> -				    bool do_accounting)
> +				    bool do_accounting, bool dirty_tracking)
>  {
>  	int unlocked;
>  	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>  	if (!vpfn)
>  		return 0;
>  
> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> +	if (vpfn->unpinned)
> +		return 0;

Combine with above, if (!vpfn || !vpfn->ref_count)

> +
> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
>  
>  	if (do_accounting)
>  		vfio_lock_acct(dma, -unlocked, true);
> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>  		if (vpfn) {
> -			phys_pfn[i] = vpfn->pfn;
> -			continue;
> +			if (vpfn->unpinned)
> +				vfio_remove_from_pfn_list(dma, vpfn);

This seems inefficient, we have an allocated vpfn at the right places
in the list, wouldn't it be better to repin the page?

> +			else {
> +				phys_pfn[i] = vpfn->pfn;
> +				continue;
> +			}
>  		}
>  
>  		remote_vaddr = dma->vaddr + iova - dma->iova;
> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>  		if (ret) {
> -			vfio_unpin_page_external(dma, iova, do_accounting);
> +			vfio_unpin_page_external(dma, iova, do_accounting,
> +						 false);
>  			goto pin_unwind;
>  		}
>  	}
> @@ -598,7 +638,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  
>  		iova = user_pfn[j] << PAGE_SHIFT;
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
>  		phys_pfn[j] = 0;
>  	}
>  pin_done:
> @@ -632,7 +672,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>  		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>  		if (!dma)
>  			goto unpin_exit;
> -		vfio_unpin_page_external(dma, iova, do_accounting);
> +		vfio_unpin_page_external(dma, iova, do_accounting,
> +					 iommu->dirty_page_tracking);
>  	}
>  
>  unpin_exit:
> @@ -850,6 +891,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	return bitmap;
>  }
>  
> +/*
> + * start_iova is the reference from where bitmaping started. This is called
> + * from DMA_UNMAP where start_iova can be different than iova
> + */
> +
> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> +				  size_t size, uint64_t pgsize,
> +				  dma_addr_t start_iova, unsigned long *bitmap)
> +{
> +	struct vfio_dma *dma;
> +	dma_addr_t i = iova;
> +	unsigned long pgshift = __ffs(pgsize);
> +
> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> +		/* mark all pages dirty if all pages are pinned and mapped. */
> +		if (dma->iommu_mapped) {
> +			dma_addr_t iova_limit;
> +
> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
> +				     (dma->iova + dma->size) : (iova + size);

min(dma->iova + dma->size, iova + size);

> +
> +			for (; i < iova_limit; i += pgsize) {
> +				unsigned int start;
> +
> +				start = (i - start_iova) >> pgshift;
> +
> +				__bitmap_set(bitmap, start, 1);

Why __bitmap_set() rather than bitmap_set()?  Also why not try to take
advantage of the number of bits arg?

> +			}
> +			if (i >= iova + size)
> +				return;

This skips the removed unpinned callback at the end of the loop,
leaving unnecessarily tracked, unpinned vpfns.

> +		} else {
> +			struct rb_node *n = rb_first(&dma->pfn_list);
> +			bool found = false;
> +
> +			for (; n; n = rb_next(n)) {
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +							struct vfio_pfn, node);
> +				if (vpfn->iova >= i) {

This doesn't look right, how does a vpfn with .iova > i necessarily
contain i?

> +					found = true;
> +					break;
> +				}
> +			}
> +
> +			if (!found) {
> +				i += dma->size;
> +				continue;
> +			}
> +
> +			for (; n; n = rb_next(n)) {
> +				unsigned int start;
> +				struct vfio_pfn *vpfn = rb_entry(n,
> +							struct vfio_pfn, node);
> +
> +				if (vpfn->iova >= iova + size)
> +					return;
> +
> +				start = (vpfn->iova - start_iova) >> pgshift;
> +
> +				__bitmap_set(bitmap, start, 1);

Don't we need to iterate over the vfpn relative to pgsize?  Oh, I
see below that pgsize is the minimum user advertised size, which is at
least PAGE_SIZE, to maybe not.  Same bitmap_set() question as above
though.

> +
> +				i = vpfn->iova + pgsize;
> +			}
> +		}
> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> +	}
> +}
> +
> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> +{
> +	long bsize;
> +
> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> +		return -EINVAL;
> +
> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> +
> +	if (bitmap_size < bsize)
> +		return -EINVAL;
> +
> +	return bsize;
> +}
> +
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  			     struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> @@ -2297,6 +2420,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> +		struct vfio_iommu_type1_dirty_bitmap range;
> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +		int ret;
> +
> +		if (!iommu->v2)
> +			return -EACCES;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> +				    bitmap);
> +
> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (range.argsz < minsz || range.flags & ~mask)
> +			return -EINVAL;
> +

flags should be sanitized further, invalid combinations should be
rejected.  For example, if a user provides STOP|GET_BITMAP it should
either populate the bitmap AND turn off tracking, or error.  It's not
acceptable to turn off tracking and silently ignore GET_BITMAP.

> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> +			iommu->dirty_page_tracking = true;
> +			return 0;
> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> +			iommu->dirty_page_tracking = false;
> +
> +			mutex_lock(&iommu->lock);
> +			vfio_remove_unpinned_from_dma_list(iommu);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +
> +		} else if (range.flags &
> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> +			uint64_t iommu_pgmask;
> +			unsigned long pgshift = __ffs(range.pgsize);
> +			unsigned long *bitmap;
> +			long bsize;
> +
> +			iommu_pgmask =
> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> +
> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> +			    (range.pgsize - 1))
> +				return -EINVAL;
> +
> +			if (range.iova & iommu_pgmask)
> +				return -EINVAL;
> +			if (!range.size || range.size > SIZE_MAX)
> +				return -EINVAL;
> +			if (range.iova + range.size < range.iova)
> +				return -EINVAL;
> +
> +			bsize = verify_bitmap_size(range.size >> pgshift,
> +						   range.bitmap_size);
> +			if (bsize < 0)
> +				return ret;
> +
> +			bitmap = kmalloc(bsize, GFP_KERNEL);

I think I remember mentioning previously that we cannot allocate an
arbitrary buffer on behalf of the user, it's far too easy for them to
kill the kernel that way.  kmalloc is also limited in what it can
alloc.  Can't we use the user buffer directly or only work on a part of
it at a time?

> +			if (!bitmap)
> +				return -ENOMEM;
> +
> +			ret = copy_from_user(bitmap,
> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> +			if (ret)
> +				goto bitmap_exit;
> +
> +			iommu->dirty_page_tracking = false;

a) This is done outside of the mutex and susceptible to races, b) why
is this done?

Thanks,
Alex

> +			mutex_lock(&iommu->lock);
> +			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
> +					     range.pgsize, range.iova, bitmap);
> +			mutex_unlock(&iommu->lock);
> +
> +			ret = copy_to_user((void __user *)range.bitmap, bitmap,
> +					   range.bitmap_size) ? -EFAULT : 0;
> +bitmap_exit:
> +			kfree(bitmap);
> +			return ret;
> +		}
>  	}
>  
>  	return -ENOTTY;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 4/6] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
  2019-12-17 17:10 ` [PATCH v11 Kernel 4/6] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
@ 2019-12-17 22:55   ` Alex Williamson
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Williamson @ 2019-12-17 22:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 17 Dec 2019 22:40:49 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Pages, pinned by external interface for requested IO virtual address
> range,  might get unpinned  and unmapped while migration is active and
> device is still running, that is, in pre-copy phase while guest driver
> still could access those pages. Host device can write to these pages while
> those were mapped. Such pages should be marked dirty so that after
> migration guest driver should still be able to complete the operation.
> 
> To get bitmap during unmap, user should set flag
> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
> zeroed by user space application. Bitmap size and page size should be set
> by user application.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 63 ++++++++++++++++++++++++++++++++++++-----
>  include/uapi/linux/vfio.h       | 12 ++++++++
>  2 files changed, 68 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 215aecb25453..101c2b1e72b4 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -974,7 +974,8 @@ static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>  }
>  
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> -			     struct vfio_iommu_type1_dma_unmap *unmap)
> +			     struct vfio_iommu_type1_dma_unmap *unmap,
> +			     unsigned long *bitmap)
>  {
>  	uint64_t mask;
>  	struct vfio_dma *dma, *dma_last = NULL;
> @@ -1049,6 +1050,15 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		if (dma->task->mm != current->mm)
>  			break;
>  
> +		if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
> +		    (dma_last != dma))
> +			vfio_iova_dirty_bitmap(iommu, dma->iova, dma->size,
> +					     unmap->bitmap_pgsize, unmap->iova,
> +					     bitmap);
> +		else
> +			vfio_remove_unpinned_from_pfn_list(dma, true);
> +
> +
>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>  
> @@ -1074,6 +1084,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  						    &nb_unmap);
>  			goto again;
>  		}
> +
>  		unmapped += dma->size;
>  		vfio_remove_dma(iommu, dma);
>  	}
> @@ -2404,22 +2415,60 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
>  		struct vfio_iommu_type1_dma_unmap unmap;
> -		long ret;
> +		unsigned long *bitmap = NULL;
> +		long ret, bsize;
>  
>  		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
>  
> -		if (copy_from_user(&unmap, (void __user *)arg, minsz))
> +		if (copy_from_user(&unmap, (void __user *)arg, sizeof(unmap)))

If we only require minsz, how are we going to copy sizeof(unmap)?  This
breaks existing userspace.  You'll need to copy the remainder of the
user data after validating that they've provided it.

>  			return -EFAULT;
>  
> -		if (unmap.argsz < minsz || unmap.flags)
> +		if (unmap.argsz < minsz ||
> +		    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
>  			return -EINVAL;
>  
> -		ret = vfio_dma_do_unmap(iommu, &unmap);
> +		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> +			unsigned long pgshift = __ffs(unmap.bitmap_pgsize);
> +			uint64_t iommu_pgmask =
> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> +
> +			if (((unmap.bitmap_pgsize - 1) & iommu_pgmask) !=
> +			     (unmap.bitmap_pgsize - 1))
> +				return -EINVAL;
> +
> +			bsize = verify_bitmap_size(unmap.size >> pgshift,
> +						   unmap.bitmap_size);
> +			if (bsize < 0)
> +				return bsize;
> +
> +			bitmap = kmalloc(bsize, GFP_KERNEL);

Same allocation that we cannot do.  Thanks,

Alex

> +			if (!bitmap)
> +				return -ENOMEM;
> +
> +			if (copy_from_user(bitmap, (void __user *)unmap.bitmap,
> +					   bsize)) {
> +				ret = -EFAULT;
> +				goto unmap_exit;
> +			}
> +		}
> +
> +		ret = vfio_dma_do_unmap(iommu, &unmap, bitmap);
>  		if (ret)
> -			return ret;
> +			goto unmap_exit;
>  
> -		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> +		if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> +			if (copy_to_user((void __user *)unmap.bitmap, bitmap,
> +					  bsize)) {
> +				ret = -EFAULT;
> +				goto unmap_exit;
> +			}
> +		}
> +
> +		ret = copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +unmap_exit:
> +		kfree(bitmap);
> +		return ret;
>  	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>  		struct vfio_iommu_type1_dirty_bitmap range;
>  		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8268634e7e08..e8e044c4974d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -964,12 +964,24 @@ struct vfio_iommu_type1_dma_map {
>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>   * or size different from those used in the original mapping call will
>   * succeed.
> + * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get dirty bitmap
> + * before unmapping IO virtual addresses. When this flag is set, user should
> + * allocate memory to get bitmap, clear the bitmap memory by setting zero and
> + * should set size of allocated memory in bitmap_size field. One bit in bitmap
> + * represents per page , page of user provided page size in 'bitmap_pgsize',
> + * consecutively starting from iova offset. Bit set indicates page at that
> + * offset from iova is dirty. Bitmap of pages in the range of unmapped size is
> + * returned in bitmap.
>   */
>  struct vfio_iommu_type1_dma_unmap {
>  	__u32	argsz;
>  	__u32	flags;
> +#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
> +	__u64        bitmap_pgsize;		/* page size for bitmap */
> +	__u64        bitmap_size;               /* in bytes */
> +	void __user *bitmap;                    /* one bit per page */
>  };
>  
>  #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages
  2019-12-17 17:10 ` [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
@ 2019-12-18  0:12   ` Alex Williamson
  2020-01-07 20:45     ` Kirti Wankhede
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2019-12-18  0:12 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Tue, 17 Dec 2019 22:40:51 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Track dirty pages reporting capability for each vfio_device by setting the
> capability flag on calling vfio_pin_pages() for that device.
> 
> In vfio_iommu_type1 module, while creating dirty pages bitmap, check if
> IOMMU backed device is present in the container. If IOMMU backed device is
> present in container then check dirty pages reporting capability for each
> vfio device in the container. If all vfio devices are capable of reporing
> dirty pages tracking by pinning pages through external API, then report
> create bitmap of pinned pages only. If IOMMU backed device is present in
> the container and any one device is not able to report dirty pages, then
> marked all pages as dirty.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vfio/vfio.c             | 33 +++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_type1.c | 44 +++++++++++++++++++++++++++++++++++++++--
>  include/linux/vfio.h            |  3 ++-
>  3 files changed, 77 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c8482624ca34..9d2fbe09768a 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -96,6 +96,8 @@ struct vfio_device {
>  	struct vfio_group		*group;
>  	struct list_head		group_next;
>  	void				*device_data;
> +	/* dirty pages reporting capable */
> +	bool				dirty_pages_cap;
>  };
>  
>  #ifdef CONFIG_VFIO_NOIOMMU
> @@ -1866,6 +1868,29 @@ int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
>  }
>  EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
>  
> +int vfio_device_is_dirty_reporting_capable(struct device *dev, bool *cap)
> +{
> +	struct vfio_device *device;
> +	struct vfio_group *group;
> +
> +	if (!dev || !cap)
> +		return -EINVAL;
> +
> +	group = vfio_group_get_from_dev(dev);
> +	if (!group)
> +		return -ENODEV;
> +
> +	device = vfio_group_get_device(group, dev);
> +	if (!device)
> +		return -ENODEV;
> +
> +	*cap = device->dirty_pages_cap;
> +	vfio_device_put(device);
> +	vfio_group_put(group);
> +	return 0;
> +}
> +EXPORT_SYMBOL(vfio_device_is_dirty_reporting_capable);

I'd suggest this just return true/false and any error condition simply
be part of the false case.

> +
>  /*
>   * Pin a set of guest PFNs and return their associated host PFNs for local
>   * domain only.
> @@ -1907,6 +1932,14 @@ int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage,
>  	else
>  		ret = -ENOTTY;
>  
> +	if (ret > 0) {
> +		struct vfio_device *device = vfio_group_get_device(group, dev);
> +
> +		if (device) {
> +			device->dirty_pages_cap = true;
> +			vfio_device_put(device);
> +		}
> +	}

I think we'd want to trivially rework vfio_pin_pages() to use
vfio_device_get_from_dev() instead of vfio_group_get_from_dev(), then
we have access to the group via device->group.  Then vfio_device_put()
would be common in the return path.

>  	vfio_group_try_dissolve_container(group);
>  
>  err_pin_pages:
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 68d8ed3b2665..ef56f31f4e73 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -891,6 +891,39 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	return bitmap;
>  }
>  
> +static int vfio_is_dirty_pages_reporting_capable(struct device *dev, void *data)
> +{
> +	bool new;
> +	int ret;
> +
> +	ret = vfio_device_is_dirty_reporting_capable(dev, &new);
> +	if (ret)
> +		return ret;
> +
> +	*(bool *)data = *(bool *)data && new;
> +
> +	return 0;
> +}
> +
> +static bool vfio_dirty_pages_reporting_capable(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *d;
> +	struct vfio_group *g;
> +	bool capable = true;
> +	int ret;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		list_for_each_entry(g, &d->group_list, next) {
> +			ret = iommu_group_for_each_dev(g->iommu_group, &capable,
> +					vfio_is_dirty_pages_reporting_capable);
> +			if (ret)
> +				return false;

This will fail when there are devices within the IOMMU group that are
not represented as vfio_devices.  My original suggestion was:

On Thu, 14 Nov 2019 14:06:25 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:
> I think it does so by pinning pages.  Is it acceptable that if the
> vendor driver pins any pages, then from that point forward we consider
> the IOMMU group dirty page scope to be limited to pinned pages?  There
> are complications around non-singleton IOMMU groups, but I think we're
> already leaning towards that being a non-worthwhile problem to solve.
> So if we require that only singleton IOMMU groups can pin pages and we

We could tag vfio_groups as singleton at vfio_add_group_dev() time with
an iommu_group_for_each_dev() walk so that we can cache the value on
the struct vfio_group.  vfio_group_nb_add_dev() could update this if
the IOMMU group composition changes.  vfio_pin_pages() could return
-EINVAL if (!group->is_singleton).

> pass the IOMMU group as a parameter to
> vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
> flag on its local vfio_group struct to indicate dirty page scope is
> limited to pinned pages.

ie. vfio_iommu_type1_unpin_pages() calls find_iommu_group() on each
domain in domain_list and the external_domain using the struct
iommu_group pointer provided by vfio-core.  We set a new attribute on
the vfio_group to indicate that vfio_group has (at some point) pinned
pages.

>  We might want to keep a flag on the
> vfio_iommu struct to indicate if all of the vfio_groups for each
> vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
> pinned pages as an optimization to avoid walking lists too often.  Then
> we could test if vfio_iommu.domain_list is not empty and this new flag
> does not limit the dirty page scope, then everything within each
> vfio_dma is considered dirty.

So at the point where we change vfio_group.has_pinned_pages from false
to true, or a group is added or removed, we walk all the groups in the
vfio_iommu and if they all have has_pinned_pages set, we can set a
vfio_iommu.pinned_page_dirty_scope flag to true.  If that flag is
already true on page pinning, we can skip the lookup.

I still like this approach better, it doesn't require a callback from
type1 to vfio-core and it doesn't require a heavy weight walking for
group devices and vfio data structures every time we fill a bitmap.
Did you run into issues trying to implement this approach?  Thanks,

Alex

> +		}
> +	}
> +
> +	return capable;
> +}
> +
>  /*
>   * start_iova is the reference from where bitmaping started. This is called
>   * from DMA_UNMAP where start_iova can be different than iova
> @@ -903,10 +936,17 @@ static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>  	struct vfio_dma *dma;
>  	dma_addr_t i = iova;
>  	unsigned long pgshift = __ffs(pgsize);
> +	bool dirty_report_cap = true;
> +
> +	if (IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> +		dirty_report_cap = vfio_dirty_pages_reporting_capable(iommu);
>  
>  	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> -		/* mark all pages dirty if all pages are pinned and mapped. */
> -		if (dma->iommu_mapped) {
> +		/*
> +		 * mark all pages dirty if any IOMMU capable device is not able
> +		 * to report dirty pages and all pages are pinned and mapped.
> +		 */
> +		if (!dirty_report_cap && dma->iommu_mapped) {
>  			dma_addr_t iova_limit;
>  
>  			iova_limit = (dma->iova + dma->size) < (iova + size) ?
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e42a711a2800..ed3832ea10a1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -148,7 +148,8 @@ extern int vfio_info_add_capability(struct vfio_info_cap *caps,
>  extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
>  					      int num_irqs, int max_irq_type,
>  					      size_t *data_size);
> -
> +extern int vfio_device_is_dirty_reporting_capable(struct device *dev,
> +						  bool *cap);
>  struct pci_dev;
>  #if IS_ENABLED(CONFIG_VFIO_SPAPR_EEH)
>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2019-12-17 22:12   ` Alex Williamson
@ 2020-01-07 20:07     ` Kirti Wankhede
  2020-01-07 22:02       ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Kirti Wankhede @ 2020-01-07 20:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 12/18/2019 3:42 AM, Alex Williamson wrote:
> On Tue, 17 Dec 2019 22:40:48 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
>> - Start unpinned pages dirty pages tracking while migration is active and
>>    device is running, i.e. during pre-copy phase.
>> - Stop unpinned pages dirty pages tracking. This is required to stop
>>    unpinned dirty pages tracking if migration failed or cancelled during
>>    pre-copy phase. Unpinned pages tracking is clear.
>> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
>>    unpinned pages information on bitmap read. This ioctl returns bitmap of
>>    dirty pages, its user space application responsibility to copy content
>>    of dirty pages from source to destination during migration.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   drivers/vfio/vfio_iommu_type1.c | 218 ++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 209 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 2ada8e6cdb88..215aecb25453 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -70,6 +70,7 @@ struct vfio_iommu {
>>   	unsigned int		dma_avail;
>>   	bool			v2;
>>   	bool			nesting;
>> +	bool			dirty_page_tracking;
>>   };
>>   
>>   struct vfio_domain {
>> @@ -112,6 +113,7 @@ struct vfio_pfn {
>>   	dma_addr_t		iova;		/* Device address */
>>   	unsigned long		pfn;		/* Host pfn */
>>   	atomic_t		ref_count;
>> +	bool			unpinned;
> 
> Doesn't this duplicate ref_count == 0?
> 

Yes, actually. Removing unpinned.

>>   };
>>   
>>   struct vfio_regions {
>> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
>>   	kfree(vpfn);
>>   }
>>   
>> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
>> +{
>> +	struct rb_node *n = rb_first(&dma->pfn_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
>> +
>> +		if (warn)
>> +			WARN_ON_ONCE(vpfn->unpinned);
> 
> This option isn't used within this patch, perhaps better to add with
> its use case, but it seems this presents both a denial of service via
> kernel tainting and an undocumented feature/bug.  As I interpret its
> use within the next patch, this generates a warning if the user
> unmapped the IOVA with dirty pages present, without using the dirty
> bitmap extension of the unmap call.  Our job is not to babysit the
> user, if they don't care to look at the dirty bitmap, that's their
> prerogative.  Drop this warning and the function arg.
> 

I was trying to extra cautious. Dropping this warning.

>> +
>> +		if (vpfn->unpinned)
> 
> if (!atomic_read(&vpfn->ref_count))
> 

Ok.

>> +			vfio_remove_from_pfn_list(dma, vpfn);
>> +	}
>> +}
>> +
>> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
>> +{
>> +	struct rb_node *n = rb_first(&iommu->dma_list);
>> +
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
>> +	}
>> +}
>> +
>>   static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>>   					       unsigned long iova)
>>   {
>> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
>>   	return vpfn;
>>   }
>>   
>> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
>> +				  bool dirty_tracking)
>>   {
>>   	int ret = 0;
>>   
>>   	if (atomic_dec_and_test(&vpfn->ref_count)) {
>>   		ret = put_pfn(vpfn->pfn, dma->prot);
>> -		vfio_remove_from_pfn_list(dma, vpfn);
>> +		if (dirty_tracking)
>> +			vpfn->unpinned = true;
>> +		else
>> +			vfio_remove_from_pfn_list(dma, vpfn);
> 
> This can also simply use ref_count.  BTW, checking the locking, I think
> ->ref_count is only manipulated under iommu->lock, therefore the atomic
> ops are probably overkill.
> 

Yes, I'll create a seperate commit to remove atomic.

>>   	}
>>   	return ret;
>>   }
>> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>>   }
>>   
>>   static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>> -				    bool do_accounting)
>> +				    bool do_accounting, bool dirty_tracking)
>>   {
>>   	int unlocked;
>>   	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
>> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
>>   	if (!vpfn)
>>   		return 0;
>>   
>> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
>> +	if (vpfn->unpinned)
>> +		return 0;
> 
> Combine with above, if (!vpfn || !vpfn->ref_count)
> 

Yes.

>> +
>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
>>   
>>   	if (do_accounting)
>>   		vfio_lock_acct(dma, -unlocked, true);
>> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   
>>   		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>>   		if (vpfn) {
>> -			phys_pfn[i] = vpfn->pfn;
>> -			continue;
>> +			if (vpfn->unpinned)
>> +				vfio_remove_from_pfn_list(dma, vpfn);
> 
> This seems inefficient, we have an allocated vpfn at the right places
> in the list, wouldn't it be better to repin the page?
> 

vfio_pin_page_external() takes care of pinning and accounting as well.


>> +			else {
>> +				phys_pfn[i] = vpfn->pfn;
>> +				continue;
>> +			}
>>   		}
>>   
>>   		remote_vaddr = dma->vaddr + iova - dma->iova;
>> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   
>>   		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>>   		if (ret) {
>> -			vfio_unpin_page_external(dma, iova, do_accounting);
>> +			vfio_unpin_page_external(dma, iova, do_accounting,
>> +						 false);
>>   			goto pin_unwind;
>>   		}
>>   	}
>> @@ -598,7 +638,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   
>>   		iova = user_pfn[j] << PAGE_SHIFT;
>>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
>>   		phys_pfn[j] = 0;
>>   	}
>>   pin_done:
>> @@ -632,7 +672,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>>   		if (!dma)
>>   			goto unpin_exit;
>> -		vfio_unpin_page_external(dma, iova, do_accounting);
>> +		vfio_unpin_page_external(dma, iova, do_accounting,
>> +					 iommu->dirty_page_tracking);
>>   	}
>>   
>>   unpin_exit:
>> @@ -850,6 +891,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>>   	return bitmap;
>>   }
>>   
>> +/*
>> + * start_iova is the reference from where bitmaping started. This is called
>> + * from DMA_UNMAP where start_iova can be different than iova
>> + */
>> +
>> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>> +				  size_t size, uint64_t pgsize,
>> +				  dma_addr_t start_iova, unsigned long *bitmap)
>> +{
>> +	struct vfio_dma *dma;
>> +	dma_addr_t i = iova;
>> +	unsigned long pgshift = __ffs(pgsize);
>> +
>> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
>> +		/* mark all pages dirty if all pages are pinned and mapped. */
>> +		if (dma->iommu_mapped) {
>> +			dma_addr_t iova_limit;
>> +
>> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
>> +				     (dma->iova + dma->size) : (iova + size);
> 
> min(dma->iova + dma->size, iova + size);
> 
>> +
>> +			for (; i < iova_limit; i += pgsize) {
>> +				unsigned int start;
>> +
>> +				start = (i - start_iova) >> pgshift;
>> +
>> +				__bitmap_set(bitmap, start, 1);
> 
> Why __bitmap_set() rather than bitmap_set()?  Also why not try to take
> advantage of the number of bits arg?
> 
bitmap_set() can be used, I didn't checked about it earlier.
Yes, we can take advantage of nbits, updating it.


>> +			}
>> +			if (i >= iova + size)
>> +				return;
> 
> This skips the removed unpinned callback at the end of the loop,
> leaving unnecessarily tracked, unpinned vpfns.
> 

Right, fixing it.

>> +		} else {
>> +			struct rb_node *n = rb_first(&dma->pfn_list);
>> +			bool found = false;
>> +
>> +			for (; n; n = rb_next(n)) {
>> +				struct vfio_pfn *vpfn = rb_entry(n,
>> +							struct vfio_pfn, node);
>> +				if (vpfn->iova >= i) {
> 
> This doesn't look right, how does a vpfn with .iova > i necessarily
> contain i?
> 

i might not be equal to dma->iova.
Also iova == i might not be pinned, but there might be pages pinned 
between i to iova + size. So find a vpfn node whose (vpfn->iova >= i)

>> +					found = true;
>> +					break;
>> +				}
>> +			}
>> +
>> +			if (!found) {
>> +				i += dma->size;
>> +				continue;
>> +			}
>> +
>> +			for (; n; n = rb_next(n)) {
>> +				unsigned int start;
>> +				struct vfio_pfn *vpfn = rb_entry(n,
>> +							struct vfio_pfn, node);
>> +
>> +				if (vpfn->iova >= iova + size)
>> +					return;
>> +
>> +				start = (vpfn->iova - start_iova) >> pgshift;
>> +
>> +				__bitmap_set(bitmap, start, 1);
> 
> Don't we need to iterate over the vfpn relative to pgsize?  Oh, I
> see below that pgsize is the minimum user advertised size, which is at
> least PAGE_SIZE, to maybe not.  Same bitmap_set() question as above
> though.
> 

Changing to bitmap_set.

>> +
>> +				i = vpfn->iova + pgsize;
>> +			}
>> +		}
>> +		vfio_remove_unpinned_from_pfn_list(dma, false);
>> +	}
>> +}
>> +
>> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
>> +{
>> +	long bsize;
>> +
>> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
>> +		return -EINVAL;
>> +
>> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
>> +
>> +	if (bitmap_size < bsize)
>> +		return -EINVAL;
>> +
>> +	return bsize;
>> +}
>> +
>>   static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>   			     struct vfio_iommu_type1_dma_unmap *unmap)
>>   {
>> @@ -2297,6 +2420,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>   
>>   		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>   			-EFAULT : 0;
>> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
>> +		struct vfio_iommu_type1_dirty_bitmap range;
>> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
>> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>> +		int ret;
>> +
>> +		if (!iommu->v2)
>> +			return -EACCES;
>> +
>> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
>> +				    bitmap);
>> +
>> +		if (copy_from_user(&range, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (range.argsz < minsz || range.flags & ~mask)
>> +			return -EINVAL;
>> +
> 
> flags should be sanitized further, invalid combinations should be
> rejected.  For example, if a user provides STOP|GET_BITMAP it should
> either populate the bitmap AND turn off tracking, or error.  It's not
> acceptable to turn off tracking and silently ignore GET_BITMAP.
> 

Ok. adding a check such that only one flag should be set at a time is 
valid.

>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>> +			iommu->dirty_page_tracking = true;
>> +			return 0;
>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>> +			iommu->dirty_page_tracking = false;
>> +
>> +			mutex_lock(&iommu->lock);
>> +			vfio_remove_unpinned_from_dma_list(iommu);
>> +			mutex_unlock(&iommu->lock);
>> +			return 0;
>> +
>> +		} else if (range.flags &
>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>> +			uint64_t iommu_pgmask;
>> +			unsigned long pgshift = __ffs(range.pgsize);
>> +			unsigned long *bitmap;
>> +			long bsize;
>> +
>> +			iommu_pgmask =
>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>> +
>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
>> +			    (range.pgsize - 1))
>> +				return -EINVAL;
>> +
>> +			if (range.iova & iommu_pgmask)
>> +				return -EINVAL;
>> +			if (!range.size || range.size > SIZE_MAX)
>> +				return -EINVAL;
>> +			if (range.iova + range.size < range.iova)
>> +				return -EINVAL;
>> +
>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>> +						   range.bitmap_size);
>> +			if (bsize < 0)
>> +				return ret;
>> +
>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
> 
> I think I remember mentioning previously that we cannot allocate an
> arbitrary buffer on behalf of the user, it's far too easy for them to
> kill the kernel that way.  kmalloc is also limited in what it can
> alloc.  

That's the reason added verify_bitmap_size(), so that size is verified

> Can't we use the user buffer directly or only work on a part of
> it at a time?
> 

without copy_from_user(), how?


>> +			if (!bitmap)
>> +				return -ENOMEM;
>> +
>> +			ret = copy_from_user(bitmap,
>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
>> +			if (ret)
>> +				goto bitmap_exit;
>> +
>> +			iommu->dirty_page_tracking = false;
> 
> a) This is done outside of the mutex and susceptible to races,

moving inside lock

> b) why is this done?
once bitmap is read, dirty_page_tracking should be stopped. Right?

Thanks,
Kirti

> 
> Thanks,
> Alex
> 
>> +			mutex_lock(&iommu->lock);
>> +			vfio_iova_dirty_bitmap(iommu, range.iova, range.size,
>> +					     range.pgsize, range.iova, bitmap);
>> +			mutex_unlock(&iommu->lock);
>> +
>> +			ret = copy_to_user((void __user *)range.bitmap, bitmap,
>> +					   range.bitmap_size) ? -EFAULT : 0;
>> +bitmap_exit:
>> +			kfree(bitmap);
>> +			return ret;
>> +		}
>>   	}
>>   
>>   	return -ENOTTY;
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages
  2019-12-18  0:12   ` Alex Williamson
@ 2020-01-07 20:45     ` Kirti Wankhede
  2020-01-08  0:09       ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Kirti Wankhede @ 2020-01-07 20:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 12/18/2019 5:42 AM, Alex Williamson wrote:
> On Tue, 17 Dec 2019 22:40:51 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 

<snip>

> 
> This will fail when there are devices within the IOMMU group that are
> not represented as vfio_devices.  My original suggestion was:
> 
> On Thu, 14 Nov 2019 14:06:25 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
>> I think it does so by pinning pages.  Is it acceptable that if the
>> vendor driver pins any pages, then from that point forward we consider
>> the IOMMU group dirty page scope to be limited to pinned pages?  There
>> are complications around non-singleton IOMMU groups, but I think we're
>> already leaning towards that being a non-worthwhile problem to solve.
>> So if we require that only singleton IOMMU groups can pin pages and we
> 
> We could tag vfio_groups as singleton at vfio_add_group_dev() time with
> an iommu_group_for_each_dev() walk so that we can cache the value on
> the struct vfio_group. 

I don't think iommu_group_for_each_dev() is required. Checking 
group->device_list in vfio_add_group_dev() if there are more than one 
device should work, right?

         list_for_each_entry(vdev, &group->device_list, group_next) {
                 if (group->is_singleton) {
                         group->is_singleton = false;
                         break;
                 } else {
                         group->is_singleton = true;
                 }
         }


> vfio_group_nb_add_dev() could update this if
> the IOMMU group composition changes. 

I don't see vfio_group_nb_add_dev() calls vfio_add_group_dev() (?)
If checking is_singleton is taken care in vfio_group_nb_add_dev(), which 
is the only place where vfio_group is allocated, that should work, I think.


> vfio_pin_pages() could return
> -EINVAL if (!group->is_singleton).
> 
>> pass the IOMMU group as a parameter to
>> vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
>> flag on its local vfio_group struct to indicate dirty page scope is
>> limited to pinned pages.
> 
> ie. vfio_iommu_type1_unpin_pages() calls find_iommu_group() on each
> domain in domain_list and the external_domain using the struct
> iommu_group pointer provided by vfio-core.  We set a new attribute on
> the vfio_group to indicate that vfio_group has (at some point) pinned
> pages.
> 
>>   We might want to keep a flag on the
>> vfio_iommu struct to indicate if all of the vfio_groups for each
>> vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
>> pinned pages as an optimization to avoid walking lists too often.  Then
>> we could test if vfio_iommu.domain_list is not empty and this new flag
>> does not limit the dirty page scope, then everything within each
>> vfio_dma is considered dirty.
> 
> So at the point where we change vfio_group.has_pinned_pages from false
> to true, or a group is added or removed, we walk all the groups in the
> vfio_iommu and if they all have has_pinned_pages set, we can set a
> vfio_iommu.pinned_page_dirty_scope flag to true.  If that flag is
> already true on page pinning, we can skip the lookup.
> 
> I still like this approach better, it doesn't require a callback from
> type1 to vfio-core and it doesn't require a heavy weight walking for
> group devices and vfio data structures every time we fill a bitmap.
> Did you run into issues trying to implement this approach? 

Thanks for elaborative steps.
This works. Changing this last commit.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-01-07 20:07     ` Kirti Wankhede
@ 2020-01-07 22:02       ` Alex Williamson
  2020-01-08 20:01         ` Kirti Wankhede
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2020-01-07 22:02 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Wed, 8 Jan 2020 01:37:03 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/18/2019 3:42 AM, Alex Williamson wrote:
> > On Tue, 17 Dec 2019 22:40:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> >> - Start unpinned pages dirty pages tracking while migration is active and
> >>    device is running, i.e. during pre-copy phase.
> >> - Stop unpinned pages dirty pages tracking. This is required to stop
> >>    unpinned dirty pages tracking if migration failed or cancelled during
> >>    pre-copy phase. Unpinned pages tracking is clear.
> >> - Get dirty pages bitmap. Stop unpinned dirty pages tracking and clear
> >>    unpinned pages information on bitmap read. This ioctl returns bitmap of
> >>    dirty pages, its user space application responsibility to copy content
> >>    of dirty pages from source to destination during migration.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   drivers/vfio/vfio_iommu_type1.c | 218 ++++++++++++++++++++++++++++++++++++++--
> >>   1 file changed, 209 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 2ada8e6cdb88..215aecb25453 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -70,6 +70,7 @@ struct vfio_iommu {
> >>   	unsigned int		dma_avail;
> >>   	bool			v2;
> >>   	bool			nesting;
> >> +	bool			dirty_page_tracking;
> >>   };
> >>   
> >>   struct vfio_domain {
> >> @@ -112,6 +113,7 @@ struct vfio_pfn {
> >>   	dma_addr_t		iova;		/* Device address */
> >>   	unsigned long		pfn;		/* Host pfn */
> >>   	atomic_t		ref_count;
> >> +	bool			unpinned;  
> > 
> > Doesn't this duplicate ref_count == 0?
> >   
> 
> Yes, actually. Removing unpinned.
> 
> >>   };
> >>   
> >>   struct vfio_regions {
> >> @@ -244,6 +246,32 @@ static void vfio_remove_from_pfn_list(struct vfio_dma *dma,
> >>   	kfree(vpfn);
> >>   }
> >>   
> >> +static void vfio_remove_unpinned_from_pfn_list(struct vfio_dma *dma, bool warn)
> >> +{
> >> +	struct rb_node *n = rb_first(&dma->pfn_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_pfn *vpfn = rb_entry(n, struct vfio_pfn, node);
> >> +
> >> +		if (warn)
> >> +			WARN_ON_ONCE(vpfn->unpinned);  
> > 
> > This option isn't used within this patch, perhaps better to add with
> > its use case, but it seems this presents both a denial of service via
> > kernel tainting and an undocumented feature/bug.  As I interpret its
> > use within the next patch, this generates a warning if the user
> > unmapped the IOVA with dirty pages present, without using the dirty
> > bitmap extension of the unmap call.  Our job is not to babysit the
> > user, if they don't care to look at the dirty bitmap, that's their
> > prerogative.  Drop this warning and the function arg.
> >   
> 
> I was trying to extra cautious. Dropping this warning.
> 
> >> +
> >> +		if (vpfn->unpinned)  
> > 
> > if (!atomic_read(&vpfn->ref_count))
> >   
> 
> Ok.
> 
> >> +			vfio_remove_from_pfn_list(dma, vpfn);
> >> +	}
> >> +}
> >> +
> >> +static void vfio_remove_unpinned_from_dma_list(struct vfio_iommu *iommu)
> >> +{
> >> +	struct rb_node *n = rb_first(&iommu->dma_list);
> >> +
> >> +	for (; n; n = rb_next(n)) {
> >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >> +
> >> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> >> +	}
> >> +}
> >> +
> >>   static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> >>   					       unsigned long iova)
> >>   {
> >> @@ -254,13 +282,17 @@ static struct vfio_pfn *vfio_iova_get_vfio_pfn(struct vfio_dma *dma,
> >>   	return vpfn;
> >>   }
> >>   
> >> -static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> >> +static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn,
> >> +				  bool dirty_tracking)
> >>   {
> >>   	int ret = 0;
> >>   
> >>   	if (atomic_dec_and_test(&vpfn->ref_count)) {
> >>   		ret = put_pfn(vpfn->pfn, dma->prot);
> >> -		vfio_remove_from_pfn_list(dma, vpfn);
> >> +		if (dirty_tracking)
> >> +			vpfn->unpinned = true;
> >> +		else
> >> +			vfio_remove_from_pfn_list(dma, vpfn);  
> > 
> > This can also simply use ref_count.  BTW, checking the locking, I think  
> > ->ref_count is only manipulated under iommu->lock, therefore the atomic  
> > ops are probably overkill.
> >   
> 
> Yes, I'll create a seperate commit to remove atomic.
> 
> >>   	}
> >>   	return ret;
> >>   }
> >> @@ -504,7 +536,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
> >>   }
> >>   
> >>   static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> >> -				    bool do_accounting)
> >> +				    bool do_accounting, bool dirty_tracking)
> >>   {
> >>   	int unlocked;
> >>   	struct vfio_pfn *vpfn = vfio_find_vpfn(dma, iova);
> >> @@ -512,7 +544,10 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
> >>   	if (!vpfn)
> >>   		return 0;
> >>   
> >> -	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
> >> +	if (vpfn->unpinned)
> >> +		return 0;  
> > 
> > Combine with above, if (!vpfn || !vpfn->ref_count)
> >   
> 
> Yes.
> 
> >> +
> >> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
> >>   
> >>   	if (do_accounting)
> >>   		vfio_lock_acct(dma, -unlocked, true);
> >> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   
> >>   		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>   		if (vpfn) {
> >> -			phys_pfn[i] = vpfn->pfn;
> >> -			continue;
> >> +			if (vpfn->unpinned)
> >> +				vfio_remove_from_pfn_list(dma, vpfn);  
> > 
> > This seems inefficient, we have an allocated vpfn at the right places
> > in the list, wouldn't it be better to repin the page?
> >   
> 
> vfio_pin_page_external() takes care of pinning and accounting as well.

Yes, but could we call vfio_pin_page_external() without {unlinking,
freeing} and {re-allocating, linking} on either side of it since it's
already in the list?  That's the inefficient part.  Maybe at least a
TODO comment?

> >> +			else {
> >> +				phys_pfn[i] = vpfn->pfn;
> >> +				continue;
> >> +			}
> >>   		}
> >>   
> >>   		remote_vaddr = dma->vaddr + iova - dma->iova;
> >> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   
> >>   		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >>   		if (ret) {
> >> -			vfio_unpin_page_external(dma, iova, do_accounting);
> >> +			vfio_unpin_page_external(dma, iova, do_accounting,
> >> +						 false);
> >>   			goto pin_unwind;
> >>   		}
> >>   	}
> >> @@ -598,7 +638,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>   
> >>   		iova = user_pfn[j] << PAGE_SHIFT;
> >>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >> +		vfio_unpin_page_external(dma, iova, do_accounting, false);
> >>   		phys_pfn[j] = 0;
> >>   	}
> >>   pin_done:
> >> @@ -632,7 +672,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
> >>   		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >>   		if (!dma)
> >>   			goto unpin_exit;
> >> -		vfio_unpin_page_external(dma, iova, do_accounting);
> >> +		vfio_unpin_page_external(dma, iova, do_accounting,
> >> +					 iommu->dirty_page_tracking);
> >>   	}
> >>   
> >>   unpin_exit:
> >> @@ -850,6 +891,88 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >>   	return bitmap;
> >>   }
> >>   
> >> +/*
> >> + * start_iova is the reference from where bitmaping started. This is called
> >> + * from DMA_UNMAP where start_iova can be different than iova
> >> + */
> >> +
> >> +static void vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> >> +				  size_t size, uint64_t pgsize,
> >> +				  dma_addr_t start_iova, unsigned long *bitmap)
> >> +{
> >> +	struct vfio_dma *dma;
> >> +	dma_addr_t i = iova;
> >> +	unsigned long pgshift = __ffs(pgsize);
> >> +
> >> +	while ((dma = vfio_find_dma(iommu, i, pgsize))) {
> >> +		/* mark all pages dirty if all pages are pinned and mapped. */
> >> +		if (dma->iommu_mapped) {
> >> +			dma_addr_t iova_limit;
> >> +
> >> +			iova_limit = (dma->iova + dma->size) < (iova + size) ?
> >> +				     (dma->iova + dma->size) : (iova + size);  
> > 
> > min(dma->iova + dma->size, iova + size);
> >   
> >> +
> >> +			for (; i < iova_limit; i += pgsize) {
> >> +				unsigned int start;
> >> +
> >> +				start = (i - start_iova) >> pgshift;
> >> +
> >> +				__bitmap_set(bitmap, start, 1);  
> > 
> > Why __bitmap_set() rather than bitmap_set()?  Also why not try to take
> > advantage of the number of bits arg?
> >   
> bitmap_set() can be used, I didn't checked about it earlier.
> Yes, we can take advantage of nbits, updating it.
> 
> 
> >> +			}
> >> +			if (i >= iova + size)
> >> +				return;  
> > 
> > This skips the removed unpinned callback at the end of the loop,
> > leaving unnecessarily tracked, unpinned vpfns.
> >   
> 
> Right, fixing it.
> 
> >> +		} else {
> >> +			struct rb_node *n = rb_first(&dma->pfn_list);
> >> +			bool found = false;
> >> +
> >> +			for (; n; n = rb_next(n)) {
> >> +				struct vfio_pfn *vpfn = rb_entry(n,
> >> +							struct vfio_pfn, node);
> >> +				if (vpfn->iova >= i) {  
> > 
> > This doesn't look right, how does a vpfn with .iova > i necessarily
> > contain i?
> >   
> 
> i might not be equal to dma->iova.
> Also iova == i might not be pinned, but there might be pages pinned 
> between i to iova + size. So find a vpfn node whose (vpfn->iova >= i)

Ok, I think my concern is resolved when it falls through to the test
below where we make sure the vpfn isn't past the end of the range.

> >> +					found = true;
> >> +					break;
> >> +				}
> >> +			}
> >> +
> >> +			if (!found) {
> >> +				i += dma->size;
> >> +				continue;
> >> +			}
> >> +
> >> +			for (; n; n = rb_next(n)) {
> >> +				unsigned int start;
> >> +				struct vfio_pfn *vpfn = rb_entry(n,
> >> +							struct vfio_pfn, node);
> >> +
> >> +				if (vpfn->iova >= iova + size)
> >> +					return;
> >> +
> >> +				start = (vpfn->iova - start_iova) >> pgshift;
> >> +
> >> +				__bitmap_set(bitmap, start, 1);  
> > 
> > Don't we need to iterate over the vfpn relative to pgsize?  Oh, I
> > see below that pgsize is the minimum user advertised size, which is at
> > least PAGE_SIZE, to maybe not.  Same bitmap_set() question as above
> > though.
> >   
> 
> Changing to bitmap_set.
> 
> >> +
> >> +				i = vpfn->iova + pgsize;
> >> +			}
> >> +		}
> >> +		vfio_remove_unpinned_from_pfn_list(dma, false);
> >> +	}
> >> +}
> >> +
> >> +static long verify_bitmap_size(unsigned long npages, unsigned long bitmap_size)
> >> +{
> >> +	long bsize;
> >> +
> >> +	if (!bitmap_size || bitmap_size > SIZE_MAX)
> >> +		return -EINVAL;
> >> +
> >> +	bsize = ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long);
> >> +
> >> +	if (bitmap_size < bsize)
> >> +		return -EINVAL;
> >> +
> >> +	return bsize;
> >> +}
> >> +
> >>   static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>   			     struct vfio_iommu_type1_dma_unmap *unmap)
> >>   {
> >> @@ -2297,6 +2420,83 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>   
> >>   		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >>   			-EFAULT : 0;
> >> +	} else if (cmd == VFIO_IOMMU_DIRTY_PAGES) {
> >> +		struct vfio_iommu_type1_dirty_bitmap range;
> >> +		uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
> >> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
> >> +				VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> >> +		int ret;
> >> +
> >> +		if (!iommu->v2)
> >> +			return -EACCES;
> >> +
> >> +		minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap,
> >> +				    bitmap);
> >> +
> >> +		if (copy_from_user(&range, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (range.argsz < minsz || range.flags & ~mask)
> >> +			return -EINVAL;
> >> +  
> > 
> > flags should be sanitized further, invalid combinations should be
> > rejected.  For example, if a user provides STOP|GET_BITMAP it should
> > either populate the bitmap AND turn off tracking, or error.  It's not
> > acceptable to turn off tracking and silently ignore GET_BITMAP.
> >   
> 
> Ok. adding a check such that only one flag should be set at a time is 
> valid.
> 
> >> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >> +			iommu->dirty_page_tracking = true;
> >> +			return 0;
> >> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >> +			iommu->dirty_page_tracking = false;
> >> +
> >> +			mutex_lock(&iommu->lock);
> >> +			vfio_remove_unpinned_from_dma_list(iommu);
> >> +			mutex_unlock(&iommu->lock);
> >> +			return 0;
> >> +
> >> +		} else if (range.flags &
> >> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >> +			uint64_t iommu_pgmask;
> >> +			unsigned long pgshift = __ffs(range.pgsize);
> >> +			unsigned long *bitmap;
> >> +			long bsize;
> >> +
> >> +			iommu_pgmask =
> >> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >> +
> >> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> >> +			    (range.pgsize - 1))
> >> +				return -EINVAL;
> >> +
> >> +			if (range.iova & iommu_pgmask)
> >> +				return -EINVAL;
> >> +			if (!range.size || range.size > SIZE_MAX)
> >> +				return -EINVAL;
> >> +			if (range.iova + range.size < range.iova)
> >> +				return -EINVAL;
> >> +
> >> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >> +						   range.bitmap_size);
> >> +			if (bsize < 0)
> >> +				return ret;
> >> +
> >> +			bitmap = kmalloc(bsize, GFP_KERNEL);  
> > 
> > I think I remember mentioning previously that we cannot allocate an
> > arbitrary buffer on behalf of the user, it's far too easy for them to
> > kill the kernel that way.  kmalloc is also limited in what it can
> > alloc.    
> 
> That's the reason added verify_bitmap_size(), so that size is verified

That's only a consistency test, it only verifies that the user claims
to provide a bitmap sized sufficiently for the range they're trying to
request.  range.size is limited to SIZE_MAX, so 2^64, divided by page
size for 2^52 bits, 8bits per byte for 2^49 bytes of bitmap that we'd
try to kmalloc (512TB).  kmalloc is good for a couple MB AIUI.
Meanwhile the user doesn't actually need to allocate that bitmap in
order to crash the kernel. 

> > Can't we use the user buffer directly or only work on a part of
> > it at a time?
> >   
> 
> without copy_from_user(), how?

For starters, what's the benefit of copying the bitmap from the user
in the first place?  We presume the data is zero'd and if it's not,
that's the user's bug to sort out (we just need to define the API to
specify that).  Therefore the copy_from_user() is unnecessary anyway and
we really just need to copy_to_user() for any places we're setting
bits.  We could just walk through the range with an unsigned long
bitmap buffer, writing it out to userspace any time we reach the end
with bits set, zeroing it and shifting it as a window to the user
buffer.  We could improve batching by allocating a larger buffer in the
kernel, with a kernel defined maximum size and performing the same
windowing scheme.

I don't know if there's a way to directly map the user buffer rather
than copy_to_user(), but I thought I'd ask in case there's some obvious
efficiency improvement to be had there.

> >> +			if (!bitmap)
> >> +				return -ENOMEM;
> >> +
> >> +			ret = copy_from_user(bitmap,
> >> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
> >> +			if (ret)
> >> +				goto bitmap_exit;
> >> +
> >> +			iommu->dirty_page_tracking = false;  
> > 
> > a) This is done outside of the mutex and susceptible to races,  
> 
> moving inside lock
> 
> > b) why is this done?  
> once bitmap is read, dirty_page_tracking should be stopped. Right?

Absolutely not.  Dirty bit page tracking should remain enabled until
the user turns it off, doing otherwise precludes both iterative and
partial dirty page collection from userspace.  I think both of those
are fundamentally required of this interface.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages
  2020-01-07 20:45     ` Kirti Wankhede
@ 2020-01-08  0:09       ` Alex Williamson
  2020-01-08 20:52         ` Kirti Wankhede
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2020-01-08  0:09 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Wed, 8 Jan 2020 02:15:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 12/18/2019 5:42 AM, Alex Williamson wrote:
> > On Tue, 17 Dec 2019 22:40:51 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> 
> <snip>
> 
> > 
> > This will fail when there are devices within the IOMMU group that are
> > not represented as vfio_devices.  My original suggestion was:
> > 
> > On Thu, 14 Nov 2019 14:06:25 -0700
> > Alex Williamson <alex.williamson@redhat.com> wrote:  
> >> I think it does so by pinning pages.  Is it acceptable that if the
> >> vendor driver pins any pages, then from that point forward we consider
> >> the IOMMU group dirty page scope to be limited to pinned pages?  There
> >> are complications around non-singleton IOMMU groups, but I think we're
> >> already leaning towards that being a non-worthwhile problem to solve.
> >> So if we require that only singleton IOMMU groups can pin pages and we  
> > 
> > We could tag vfio_groups as singleton at vfio_add_group_dev() time with
> > an iommu_group_for_each_dev() walk so that we can cache the value on
> > the struct vfio_group.   
> 
> I don't think iommu_group_for_each_dev() is required. Checking 
> group->device_list in vfio_add_group_dev() if there are more than one 
> device should work, right?
> 
>          list_for_each_entry(vdev, &group->device_list, group_next) {
>                  if (group->is_singleton) {
>                          group->is_singleton = false;
>                          break;
>                  } else {
>                          group->is_singleton = true;
>                  }
>          }

Hmm, I think you're taking a different approach to this than I was
thinking.  Re-reading my previous comments, the fact that both vfio.c
and vfio_iommu_type1.c each have their own private struct vfio_group
makes things rather unclear.  I was intending to use the struct
iommu_group as the object vfio.c provides to type1.c to associate the
pinning.  This would require that not only the vfio view of devices in
the group to be singleton, but also the actual iommu group to be
singleton.  Otherwise the set of devices vfio.c has in the group might
only be a subset of the group.  Maybe a merger of the approaches is
easier though.

Tracking whether the vfio.c view of a group is singleton is even easier
than above, we could simply add a device_count field to vfio_group,
increment it in vfio_group_create_device() and decrement it in
vfio_device_release().  vfio_pin_pages() could return error if
device_count is not 1.  We could still add the iommu_group pointer to
the type1 pin_pages callback, but perhaps type1 simply assumes that the
group is singleton when pin pages is called and it's vfio.c's
responsibility to maintain that group as singleton once pages have been
pinned.  vfio.c would therefore also need to set a field on the
vfio_group if pages have been pinned such that vfio_add_group_dev()
could return error if a new device attempts to join the group.  We'd
need to make sure that field is cleared when the group is released from
use and pay attention to races that might occur between adding devices
to a group and pinning pages.

> > vfio_group_nb_add_dev() could update this if
> > the IOMMU group composition changes.   
> 
> I don't see vfio_group_nb_add_dev() calls vfio_add_group_dev() (?)
> If checking is_singleton is taken care in vfio_group_nb_add_dev(), which 
> is the only place where vfio_group is allocated, that should work, I think.

This was relative to maintaining that the iommu group itself is
singleton, not just the vfio view of the group.  If we use the latter
as our basis, then you're right, we should need this, but vfio.c would
need to enforce that the group remains singleton if it has pinned
pages.  Does that make sense?  Thanks,

Alex

> > vfio_pin_pages() could return
> > -EINVAL if (!group->is_singleton).
> >   
> >> pass the IOMMU group as a parameter to
> >> vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
> >> flag on its local vfio_group struct to indicate dirty page scope is
> >> limited to pinned pages.  
> > 
> > ie. vfio_iommu_type1_unpin_pages() calls find_iommu_group() on each
> > domain in domain_list and the external_domain using the struct
> > iommu_group pointer provided by vfio-core.  We set a new attribute on
> > the vfio_group to indicate that vfio_group has (at some point) pinned
> > pages.
> >   
> >>   We might want to keep a flag on the
> >> vfio_iommu struct to indicate if all of the vfio_groups for each
> >> vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
> >> pinned pages as an optimization to avoid walking lists too often.  Then
> >> we could test if vfio_iommu.domain_list is not empty and this new flag
> >> does not limit the dirty page scope, then everything within each
> >> vfio_dma is considered dirty.  
> > 
> > So at the point where we change vfio_group.has_pinned_pages from false
> > to true, or a group is added or removed, we walk all the groups in the
> > vfio_iommu and if they all have has_pinned_pages set, we can set a
> > vfio_iommu.pinned_page_dirty_scope flag to true.  If that flag is
> > already true on page pinning, we can skip the lookup.
> > 
> > I still like this approach better, it doesn't require a callback from
> > type1 to vfio-core and it doesn't require a heavy weight walking for
> > group devices and vfio data structures every time we fill a bitmap.
> > Did you run into issues trying to implement this approach?   
> 
> Thanks for elaborative steps.
> This works. Changing this last commit.
> 
> Thanks,
> Kirti
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-01-07 22:02       ` Alex Williamson
@ 2020-01-08 20:01         ` Kirti Wankhede
  2020-01-08 22:29           ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Kirti Wankhede @ 2020-01-08 20:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 1/8/2020 3:32 AM, Alex Williamson wrote:
> On Wed, 8 Jan 2020 01:37:03 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 

<snip>

>>>> +
>>>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
>>>>    
>>>>    	if (do_accounting)
>>>>    		vfio_lock_acct(dma, -unlocked, true);
>>>> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>    
>>>>    		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>>>>    		if (vpfn) {
>>>> -			phys_pfn[i] = vpfn->pfn;
>>>> -			continue;
>>>> +			if (vpfn->unpinned)
>>>> +				vfio_remove_from_pfn_list(dma, vpfn);
>>>
>>> This seems inefficient, we have an allocated vpfn at the right places
>>> in the list, wouldn't it be better to repin the page?
>>>    
>>
>> vfio_pin_page_external() takes care of pinning and accounting as well.
> 
> Yes, but could we call vfio_pin_page_external() without {unlinking,
> freeing} and {re-allocating, linking} on either side of it since it's
> already in the list?  That's the inefficient part.  Maybe at least a
> TODO comment?
> 

Changing it as below:

                 vpfn = vfio_iova_get_vfio_pfn(dma, iova);
                 if (vpfn) {
-                       phys_pfn[i] = vpfn->pfn;
-                       continue;
+                       if (vpfn->ref_count > 1) {
+                               phys_pfn[i] = vpfn->pfn;
+                               continue;
+                       }
                 }

                 remote_vaddr = dma->vaddr + iova - dma->iova;
                 ret = vfio_pin_page_external(dma, remote_vaddr, 
&phys_pfn[i],
                                              do_accounting);
                 if (ret)
                         goto pin_unwind;
-
-               ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
-               if (ret) {
-                       vfio_unpin_page_external(dma, iova, do_accounting);
-                       goto pin_unwind;
-               }
+               if (!vpfn) {
+                       ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
+                       if (ret) {
+                               vfio_unpin_page_external(dma, iova,
+                                                        do_accounting, 
false);
+                               goto pin_unwind;
+                       }
+               } else
+                       vpfn->pfn = phys_pfn[i];
         }




>>>> +			else {
>>>> +				phys_pfn[i] = vpfn->pfn;
>>>> +				continue;
>>>> +			}
>>>>    		}
>>>>    
>>>>    		remote_vaddr = dma->vaddr + iova - dma->iova;
>>>> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>    
>>>>    		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>>>>    		if (ret) {
>>>> -			vfio_unpin_page_external(dma, iova, do_accounting);
>>>> +			vfio_unpin_page_external(dma, iova, do_accounting,
>>>> +						 false);
>>>>    			goto pin_unwind;
>>>>    		}
>>>>    	}

<snip>

>>
>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>>>> +			iommu->dirty_page_tracking = true;
>>>> +			return 0;
>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>>>> +			iommu->dirty_page_tracking = false;
>>>> +
>>>> +			mutex_lock(&iommu->lock);
>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
>>>> +			mutex_unlock(&iommu->lock);
>>>> +			return 0;
>>>> +
>>>> +		} else if (range.flags &
>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>>>> +			uint64_t iommu_pgmask;
>>>> +			unsigned long pgshift = __ffs(range.pgsize);
>>>> +			unsigned long *bitmap;
>>>> +			long bsize;
>>>> +
>>>> +			iommu_pgmask =
>>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>>>> +
>>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
>>>> +			    (range.pgsize - 1))
>>>> +				return -EINVAL;
>>>> +
>>>> +			if (range.iova & iommu_pgmask)
>>>> +				return -EINVAL;
>>>> +			if (!range.size || range.size > SIZE_MAX)
>>>> +				return -EINVAL;
>>>> +			if (range.iova + range.size < range.iova)
>>>> +				return -EINVAL;
>>>> +
>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>>>> +						   range.bitmap_size);
>>>> +			if (bsize < 0)
>>>> +				return ret;
>>>> +
>>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
>>>
>>> I think I remember mentioning previously that we cannot allocate an
>>> arbitrary buffer on behalf of the user, it's far too easy for them to
>>> kill the kernel that way.  kmalloc is also limited in what it can
>>> alloc.
>>
>> That's the reason added verify_bitmap_size(), so that size is verified
> 
> That's only a consistency test, it only verifies that the user claims
> to provide a bitmap sized sufficiently for the range they're trying to
> request.  range.size is limited to SIZE_MAX, so 2^64, divided by page
> size for 2^52 bits, 8bits per byte for 2^49 bytes of bitmap that we'd
> try to kmalloc (512TB).  kmalloc is good for a couple MB AIUI.
> Meanwhile the user doesn't actually need to allocate that bitmap in
> order to crash the kernel.
> 
>>> Can't we use the user buffer directly or only work on a part of
>>> it at a time?
>>>    
>>
>> without copy_from_user(), how?
> 
> For starters, what's the benefit of copying the bitmap from the user
> in the first place?  We presume the data is zero'd and if it's not,
> that's the user's bug to sort out (we just need to define the API to
> specify that).  Therefore the copy_from_user() is unnecessary anyway and
> we really just need to copy_to_user() for any places we're setting
> bits.  We could just walk through the range with an unsigned long
> bitmap buffer, writing it out to userspace any time we reach the end
> with bits set, zeroing it and shifting it as a window to the user
> buffer.  We could improve batching by allocating a larger buffer in the
> kernel, with a kernel defined maximum size and performing the same
> windowing scheme.
> 

Ok removing copy_from_user().
But AFAIK, calling copy_to_user() multiple times is not efficient in 
terms of performance.

Checked code in virt/kvm/kvm_main.c: __kvm_set_memory_region() where 
dirty_bitmap is allocated, that has generic checks, user space address 
check, memory overflow check and KVM_MEM_MAX_NR_PAGES as below. I'll add 
access_ok check. I already added overflow check.

         /* General sanity checks */
         if (mem->memory_size & (PAGE_SIZE - 1))
                 goto out;

        !access_ok((void __user *)(unsigned long)mem->userspace_addr,
                         mem->memory_size)))

         if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
                 goto out;

         if (npages > KVM_MEM_MAX_NR_PAGES)
                 goto out;


Where KVM_MEM_MAX_NR_PAGES is:

/*
  * Some of the bitops functions do not support too long bitmaps.
  * This number must be determined not to exceed such limits.
  */
#define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)

But we can't use KVM specific KVM_MEM_MAX_NR_PAGES check in vfio module.
Should we define similar limit in vfio module instead of SIZE_MAX?

> I don't know if there's a way to directly map the user buffer rather
> than copy_to_user(), but I thought I'd ask in case there's some obvious
> efficiency improvement to be had there.
> 
>>>> +			if (!bitmap)
>>>> +				return -ENOMEM;
>>>> +
>>>> +			ret = copy_from_user(bitmap,
>>>> +			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
>>>> +			if (ret)
>>>> +				goto bitmap_exit;
>>>> +
>>>> +			iommu->dirty_page_tracking = false;
>>>
>>> a) This is done outside of the mutex and susceptible to races,
>>
>> moving inside lock
>>
>>> b) why is this done?
>> once bitmap is read, dirty_page_tracking should be stopped. Right?
> 
> Absolutely not.  Dirty bit page tracking should remain enabled until
> the user turns it off, doing otherwise precludes both iterative and
> partial dirty page collection from userspace.  I think both of those
> are fundamentally required of this interface.  Thanks,
> 

OK.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages
  2020-01-08  0:09       ` Alex Williamson
@ 2020-01-08 20:52         ` Kirti Wankhede
  2020-01-08 22:59           ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Kirti Wankhede @ 2020-01-08 20:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 1/8/2020 5:39 AM, Alex Williamson wrote:
> On Wed, 8 Jan 2020 02:15:01 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 12/18/2019 5:42 AM, Alex Williamson wrote:
>>> On Tue, 17 Dec 2019 22:40:51 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>
>> <snip>
>>
>>>
>>> This will fail when there are devices within the IOMMU group that are
>>> not represented as vfio_devices.  My original suggestion was:
>>>
>>> On Thu, 14 Nov 2019 14:06:25 -0700
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>> I think it does so by pinning pages.  Is it acceptable that if the
>>>> vendor driver pins any pages, then from that point forward we consider
>>>> the IOMMU group dirty page scope to be limited to pinned pages?  There
>>>> are complications around non-singleton IOMMU groups, but I think we're
>>>> already leaning towards that being a non-worthwhile problem to solve.
>>>> So if we require that only singleton IOMMU groups can pin pages and we
>>>
>>> We could tag vfio_groups as singleton at vfio_add_group_dev() time with
>>> an iommu_group_for_each_dev() walk so that we can cache the value on
>>> the struct vfio_group.
>>
>> I don't think iommu_group_for_each_dev() is required. Checking
>> group->device_list in vfio_add_group_dev() if there are more than one
>> device should work, right?
>>
>>           list_for_each_entry(vdev, &group->device_list, group_next) {
>>                   if (group->is_singleton) {
>>                           group->is_singleton = false;
>>                           break;
>>                   } else {
>>                           group->is_singleton = true;
>>                   }
>>           }
> 
> Hmm, I think you're taking a different approach to this than I was
> thinking.  Re-reading my previous comments, the fact that both vfio.c
> and vfio_iommu_type1.c each have their own private struct vfio_group
> makes things rather unclear.  I was intending to use the struct
> iommu_group as the object vfio.c provides to type1.c to associate the
> pinning.  This would require that not only the vfio view of devices in
> the group to be singleton, but also the actual iommu group to be
> singleton.  Otherwise the set of devices vfio.c has in the group might
> only be a subset of the group.  Maybe a merger of the approaches is
> easier though.
> 
> Tracking whether the vfio.c view of a group is singleton is even easier
> than above, we could simply add a device_count field to vfio_group,
> increment it in vfio_group_create_device() and decrement it in
> vfio_device_release().  vfio_pin_pages() could return error if
> device_count is not 1.  We could still add the iommu_group pointer to
> the type1 pin_pages callback, but perhaps type1 simply assumes that the
> group is singleton when pin pages is called and it's vfio.c's
> responsibility to maintain that group as singleton once pages have been
> pinned.  vfio.c would therefore also need to set a field on the
> vfio_group if pages have been pinned such that vfio_add_group_dev()
> could return error if a new device attempts to join the group.  We'd
> need to make sure that field is cleared when the group is released from
> use and pay attention to races that might occur between adding devices
> to a group and pinning pages.
> 

Thinking aloud, will adding singleton check could cause issues in near 
future? - may be in future support for p2p and direct RDMA will be added 
for mdev devices. In that case the two devices should be in same 
iommu_domain, but should be in different iommu_group - is that 
understanding correct?

>>> vfio_group_nb_add_dev() could update this if
>>> the IOMMU group composition changes.
>>
>> I don't see vfio_group_nb_add_dev() calls vfio_add_group_dev() (?)
>> If checking is_singleton is taken care in vfio_group_nb_add_dev(), which
>> is the only place where vfio_group is allocated, that should work, I think.
> 
> This was relative to maintaining that the iommu group itself is
> singleton, not just the vfio view of the group.  If we use the latter
> as our basis, then you're right, we should need this, but vfio.c would
> need to enforce that the group remains singleton if it has pinned
> pages.  Does that make sense?  Thanks,
> 

Which route should be taken - iommu_group view or vfio.c group view?

Thanks,
Kirti

> Alex
> 
>>> vfio_pin_pages() could return
>>> -EINVAL if (!group->is_singleton).
>>>    
>>>> pass the IOMMU group as a parameter to
>>>> vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
>>>> flag on its local vfio_group struct to indicate dirty page scope is
>>>> limited to pinned pages.
>>>
>>> ie. vfio_iommu_type1_unpin_pages() calls find_iommu_group() on each
>>> domain in domain_list and the external_domain using the struct
>>> iommu_group pointer provided by vfio-core.  We set a new attribute on
>>> the vfio_group to indicate that vfio_group has (at some point) pinned
>>> pages.
>>>    
>>>>    We might want to keep a flag on the
>>>> vfio_iommu struct to indicate if all of the vfio_groups for each
>>>> vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
>>>> pinned pages as an optimization to avoid walking lists too often.  Then
>>>> we could test if vfio_iommu.domain_list is not empty and this new flag
>>>> does not limit the dirty page scope, then everything within each
>>>> vfio_dma is considered dirty.
>>>
>>> So at the point where we change vfio_group.has_pinned_pages from false
>>> to true, or a group is added or removed, we walk all the groups in the
>>> vfio_iommu and if they all have has_pinned_pages set, we can set a
>>> vfio_iommu.pinned_page_dirty_scope flag to true.  If that flag is
>>> already true on page pinning, we can skip the lookup.
>>>
>>> I still like this approach better, it doesn't require a callback from
>>> type1 to vfio-core and it doesn't require a heavy weight walking for
>>> group devices and vfio data structures every time we fill a bitmap.
>>> Did you run into issues trying to implement this approach?
>>
>> Thanks for elaborative steps.
>> This works. Changing this last commit.
>>
>> Thanks,
>> Kirti
>>
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-01-08 20:01         ` Kirti Wankhede
@ 2020-01-08 22:29           ` Alex Williamson
  2020-01-09 13:29             ` Kirti Wankhede
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2020-01-08 22:29 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Thu, 9 Jan 2020 01:31:16 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 1/8/2020 3:32 AM, Alex Williamson wrote:
> > On Wed, 8 Jan 2020 01:37:03 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> 
> <snip>
> 
> >>>> +
> >>>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
> >>>>    
> >>>>    	if (do_accounting)
> >>>>    		vfio_lock_acct(dma, -unlocked, true);
> >>>> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>    
> >>>>    		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>>>    		if (vpfn) {
> >>>> -			phys_pfn[i] = vpfn->pfn;
> >>>> -			continue;
> >>>> +			if (vpfn->unpinned)
> >>>> +				vfio_remove_from_pfn_list(dma, vpfn);  
> >>>
> >>> This seems inefficient, we have an allocated vpfn at the right places
> >>> in the list, wouldn't it be better to repin the page?
> >>>      
> >>
> >> vfio_pin_page_external() takes care of pinning and accounting as well.  
> > 
> > Yes, but could we call vfio_pin_page_external() without {unlinking,
> > freeing} and {re-allocating, linking} on either side of it since it's
> > already in the list?  That's the inefficient part.  Maybe at least a
> > TODO comment?
> >   
> 
> Changing it as below:
> 
>                  vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>                  if (vpfn) {
> -                       phys_pfn[i] = vpfn->pfn;
> -                       continue;
> +                       if (vpfn->ref_count > 1) {
> +                               phys_pfn[i] = vpfn->pfn;
> +                               continue;
> +                       }
>                  }
> 
>                  remote_vaddr = dma->vaddr + iova - dma->iova;
>                  ret = vfio_pin_page_external(dma, remote_vaddr, 
> &phys_pfn[i],
>                                               do_accounting);
>                  if (ret)
>                          goto pin_unwind;
> -
> -               ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> -               if (ret) {
> -                       vfio_unpin_page_external(dma, iova, do_accounting);
> -                       goto pin_unwind;
> -               }
> +               if (!vpfn) {
> +                       ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> +                       if (ret) {
> +                               vfio_unpin_page_external(dma, iova,
> +                                                        do_accounting, 
> false);
> +                               goto pin_unwind;
> +                       }
> +               } else
> +                       vpfn->pfn = phys_pfn[i];
>          }
> 
> 
> 
> 
> >>>> +			else {
> >>>> +				phys_pfn[i] = vpfn->pfn;
> >>>> +				continue;
> >>>> +			}
> >>>>    		}
> >>>>    
> >>>>    		remote_vaddr = dma->vaddr + iova - dma->iova;
> >>>> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>    
> >>>>    		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >>>>    		if (ret) {
> >>>> -			vfio_unpin_page_external(dma, iova, do_accounting);
> >>>> +			vfio_unpin_page_external(dma, iova, do_accounting,
> >>>> +						 false);
> >>>>    			goto pin_unwind;
> >>>>    		}
> >>>>    	}  
> 
> <snip>
> 
> >>  
> >>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >>>> +			iommu->dirty_page_tracking = true;
> >>>> +			return 0;
> >>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >>>> +			iommu->dirty_page_tracking = false;
> >>>> +
> >>>> +			mutex_lock(&iommu->lock);
> >>>> +			vfio_remove_unpinned_from_dma_list(iommu);
> >>>> +			mutex_unlock(&iommu->lock);
> >>>> +			return 0;
> >>>> +
> >>>> +		} else if (range.flags &
> >>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >>>> +			uint64_t iommu_pgmask;
> >>>> +			unsigned long pgshift = __ffs(range.pgsize);
> >>>> +			unsigned long *bitmap;
> >>>> +			long bsize;
> >>>> +
> >>>> +			iommu_pgmask =
> >>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >>>> +
> >>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> >>>> +			    (range.pgsize - 1))
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			if (range.iova & iommu_pgmask)
> >>>> +				return -EINVAL;
> >>>> +			if (!range.size || range.size > SIZE_MAX)
> >>>> +				return -EINVAL;
> >>>> +			if (range.iova + range.size < range.iova)
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >>>> +						   range.bitmap_size);
> >>>> +			if (bsize < 0)
> >>>> +				return ret;
> >>>> +
> >>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);  
> >>>
> >>> I think I remember mentioning previously that we cannot allocate an
> >>> arbitrary buffer on behalf of the user, it's far too easy for them to
> >>> kill the kernel that way.  kmalloc is also limited in what it can
> >>> alloc.  
> >>
> >> That's the reason added verify_bitmap_size(), so that size is verified  
> > 
> > That's only a consistency test, it only verifies that the user claims
> > to provide a bitmap sized sufficiently for the range they're trying to
> > request.  range.size is limited to SIZE_MAX, so 2^64, divided by page
> > size for 2^52 bits, 8bits per byte for 2^49 bytes of bitmap that we'd
> > try to kmalloc (512TB).  kmalloc is good for a couple MB AIUI.
> > Meanwhile the user doesn't actually need to allocate that bitmap in
> > order to crash the kernel.
> >   
> >>> Can't we use the user buffer directly or only work on a part of
> >>> it at a time?
> >>>      
> >>
> >> without copy_from_user(), how?  
> > 
> > For starters, what's the benefit of copying the bitmap from the user
> > in the first place?  We presume the data is zero'd and if it's not,
> > that's the user's bug to sort out (we just need to define the API to
> > specify that).  Therefore the copy_from_user() is unnecessary anyway and
> > we really just need to copy_to_user() for any places we're setting
> > bits.  We could just walk through the range with an unsigned long
> > bitmap buffer, writing it out to userspace any time we reach the end
> > with bits set, zeroing it and shifting it as a window to the user
> > buffer.  We could improve batching by allocating a larger buffer in the
> > kernel, with a kernel defined maximum size and performing the same
> > windowing scheme.
> >   
> 
> Ok removing copy_from_user().
> But AFAIK, calling copy_to_user() multiple times is not efficient in 
> terms of performance.

Right, but even with a modestly sized internal buffer for batching we
can cover quite a large address space.  128MB for a 4KB buffer, 32GB
with 1MB buffer.  __put_user() is more lightweight than copy_to_user(),
I wonder where the inflection point is in batching the latter versus
more iterations of the former.

> Checked code in virt/kvm/kvm_main.c: __kvm_set_memory_region() where 
> dirty_bitmap is allocated, that has generic checks, user space address 
> check, memory overflow check and KVM_MEM_MAX_NR_PAGES as below. I'll add 
> access_ok check. I already added overflow check.
> 
>          /* General sanity checks */
>          if (mem->memory_size & (PAGE_SIZE - 1))
>                  goto out;
> 
>         !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>                          mem->memory_size)))
> 
>          if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
>                  goto out;
> 
>          if (npages > KVM_MEM_MAX_NR_PAGES)
>                  goto out;
> 
> 
> Where KVM_MEM_MAX_NR_PAGES is:
> 
> /*
>   * Some of the bitops functions do not support too long bitmaps.
>   * This number must be determined not to exceed such limits.
>   */
> #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
> 
> But we can't use KVM specific KVM_MEM_MAX_NR_PAGES check in vfio module.
> Should we define similar limit in vfio module instead of SIZE_MAX?

If we have ranges that are up to 2^31 pages, that's still 2^28 bytes.
Does it seem reasonable to have a kernel interface that potentially
allocates 256MB of kernel space with kmalloc accessible to users?  That
still seems like a DoS attack vector to me, especially since the user
doesn't need to be able to map that much memory (8TB) to access it.

I notice that KVM allocate the bitmap (kvzalloc) relative to the actual
size of the memory slot when dirty logging is enabled, maybe that's the
right approach rather than walking vpfn lists and maintaining unpinned
vpfns for the purposes of tracking.  For example, when dirty logging is
enabled, walk all vfio_dmas and allocate a dirty bitmap anywhere the
vpfn list is not empty and walk the vpfn list to set dirty bits in the
bitmap.  When new pages are pinned, allocate a bitmap if not already
present and set the dirty bit.  When unpinned, update the vpfn list but
leave the dirty bit set.  When the dirty bitmap is read, copy out the
current bitmap to the user, memset it to zero, then re-walk the vpfn
list to set currently dirty pages.  A vfio_dma without a dirty bitmap
would consider the entire range dirty.  At least that way the overhead
of the bitmap is just that, overhead rather than a future exploit.
Does this seem like a better approach?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages
  2020-01-08 20:52         ` Kirti Wankhede
@ 2020-01-08 22:59           ` Alex Williamson
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Williamson @ 2020-01-08 22:59 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Thu, 9 Jan 2020 02:22:26 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 1/8/2020 5:39 AM, Alex Williamson wrote:
> > On Wed, 8 Jan 2020 02:15:01 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 12/18/2019 5:42 AM, Alex Williamson wrote:  
> >>> On Tue, 17 Dec 2019 22:40:51 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>
> >> <snip>
> >>  
> >>>
> >>> This will fail when there are devices within the IOMMU group that are
> >>> not represented as vfio_devices.  My original suggestion was:
> >>>
> >>> On Thu, 14 Nov 2019 14:06:25 -0700
> >>> Alex Williamson <alex.williamson@redhat.com> wrote:  
> >>>> I think it does so by pinning pages.  Is it acceptable that if the
> >>>> vendor driver pins any pages, then from that point forward we consider
> >>>> the IOMMU group dirty page scope to be limited to pinned pages?  There
> >>>> are complications around non-singleton IOMMU groups, but I think we're
> >>>> already leaning towards that being a non-worthwhile problem to solve.
> >>>> So if we require that only singleton IOMMU groups can pin pages and we  
> >>>
> >>> We could tag vfio_groups as singleton at vfio_add_group_dev() time with
> >>> an iommu_group_for_each_dev() walk so that we can cache the value on
> >>> the struct vfio_group.  
> >>
> >> I don't think iommu_group_for_each_dev() is required. Checking
> >> group->device_list in vfio_add_group_dev() if there are more than one
> >> device should work, right?
> >>
> >>           list_for_each_entry(vdev, &group->device_list, group_next) {
> >>                   if (group->is_singleton) {
> >>                           group->is_singleton = false;
> >>                           break;
> >>                   } else {
> >>                           group->is_singleton = true;
> >>                   }
> >>           }  
> > 
> > Hmm, I think you're taking a different approach to this than I was
> > thinking.  Re-reading my previous comments, the fact that both vfio.c
> > and vfio_iommu_type1.c each have their own private struct vfio_group
> > makes things rather unclear.  I was intending to use the struct
> > iommu_group as the object vfio.c provides to type1.c to associate the
> > pinning.  This would require that not only the vfio view of devices in
> > the group to be singleton, but also the actual iommu group to be
> > singleton.  Otherwise the set of devices vfio.c has in the group might
> > only be a subset of the group.  Maybe a merger of the approaches is
> > easier though.
> > 
> > Tracking whether the vfio.c view of a group is singleton is even easier
> > than above, we could simply add a device_count field to vfio_group,
> > increment it in vfio_group_create_device() and decrement it in
> > vfio_device_release().  vfio_pin_pages() could return error if
> > device_count is not 1.  We could still add the iommu_group pointer to
> > the type1 pin_pages callback, but perhaps type1 simply assumes that the
> > group is singleton when pin pages is called and it's vfio.c's
> > responsibility to maintain that group as singleton once pages have been
> > pinned.  vfio.c would therefore also need to set a field on the
> > vfio_group if pages have been pinned such that vfio_add_group_dev()
> > could return error if a new device attempts to join the group.  We'd
> > need to make sure that field is cleared when the group is released from
> > use and pay attention to races that might occur between adding devices
> > to a group and pinning pages.
> >   
> 
> Thinking aloud, will adding singleton check could cause issues in near 
> future? - may be in future support for p2p and direct RDMA will be added 
> for mdev devices. In that case the two devices should be in same 
> iommu_domain, but should be in different iommu_group - is that 
> understanding correct?

The ACS redirection stuff is the only thing that actually changes iommu
grouping relative to p2p/RDMA and that's specifically for untranslated
DMA, aiui.  If we wanted translated p2p DMA to be routed downstream of
the IOMMU we'd need to enable ACS direct translation.  In any case,
those are about shortening the path for p2p between devices.  We
actually don't even need devices to be in the same iommu domain to
allow p2p, we only need the iommu domain for each respective device to
map the mmio spaces of the other device.  I don't think we're doing
anything here that would cause us trouble later in this space, but it's
also just a policy decision, we wouldn't be breaking ABI to change the
implementation later.
 
> >>> vfio_group_nb_add_dev() could update this if
> >>> the IOMMU group composition changes.  
> >>
> >> I don't see vfio_group_nb_add_dev() calls vfio_add_group_dev() (?)
> >> If checking is_singleton is taken care in vfio_group_nb_add_dev(), which
> >> is the only place where vfio_group is allocated, that should work, I think.  
> > 
> > This was relative to maintaining that the iommu group itself is
> > singleton, not just the vfio view of the group.  If we use the latter
> > as our basis, then you're right, we should need this, but vfio.c would
> > need to enforce that the group remains singleton if it has pinned
> > pages.  Does that make sense?  Thanks,
> >   
> 
> Which route should be taken - iommu_group view or vfio.c group view?

The latter seems easier, more flexible, and lower overhead as far as I
can see.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-01-08 22:29           ` Alex Williamson
@ 2020-01-09 13:29             ` Kirti Wankhede
  2020-01-09 14:53               ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Kirti Wankhede @ 2020-01-09 13:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm



On 1/9/2020 3:59 AM, Alex Williamson wrote:
> On Thu, 9 Jan 2020 01:31:16 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 1/8/2020 3:32 AM, Alex Williamson wrote:
>>> On Wed, 8 Jan 2020 01:37:03 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>
>> <snip>
>>
>>>>>> +
>>>>>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
>>>>>>     
>>>>>>     	if (do_accounting)
>>>>>>     		vfio_lock_acct(dma, -unlocked, true);
>>>>>> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>>>     
>>>>>>     		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>>>>>>     		if (vpfn) {
>>>>>> -			phys_pfn[i] = vpfn->pfn;
>>>>>> -			continue;
>>>>>> +			if (vpfn->unpinned)
>>>>>> +				vfio_remove_from_pfn_list(dma, vpfn);
>>>>>
>>>>> This seems inefficient, we have an allocated vpfn at the right places
>>>>> in the list, wouldn't it be better to repin the page?
>>>>>       
>>>>
>>>> vfio_pin_page_external() takes care of pinning and accounting as well.
>>>
>>> Yes, but could we call vfio_pin_page_external() without {unlinking,
>>> freeing} and {re-allocating, linking} on either side of it since it's
>>> already in the list?  That's the inefficient part.  Maybe at least a
>>> TODO comment?
>>>    
>>
>> Changing it as below:
>>
>>                   vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>>                   if (vpfn) {
>> -                       phys_pfn[i] = vpfn->pfn;
>> -                       continue;
>> +                       if (vpfn->ref_count > 1) {
>> +                               phys_pfn[i] = vpfn->pfn;
>> +                               continue;
>> +                       }
>>                   }
>>
>>                   remote_vaddr = dma->vaddr + iova - dma->iova;
>>                   ret = vfio_pin_page_external(dma, remote_vaddr,
>> &phys_pfn[i],
>>                                                do_accounting);
>>                   if (ret)
>>                           goto pin_unwind;
>> -
>> -               ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>> -               if (ret) {
>> -                       vfio_unpin_page_external(dma, iova, do_accounting);
>> -                       goto pin_unwind;
>> -               }
>> +               if (!vpfn) {
>> +                       ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>> +                       if (ret) {
>> +                               vfio_unpin_page_external(dma, iova,
>> +                                                        do_accounting,
>> false);
>> +                               goto pin_unwind;
>> +                       }
>> +               } else
>> +                       vpfn->pfn = phys_pfn[i];
>>           }
>>
>>
>>
>>
>>>>>> +			else {
>>>>>> +				phys_pfn[i] = vpfn->pfn;
>>>>>> +				continue;
>>>>>> +			}
>>>>>>     		}
>>>>>>     
>>>>>>     		remote_vaddr = dma->vaddr + iova - dma->iova;
>>>>>> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>>>     
>>>>>>     		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
>>>>>>     		if (ret) {
>>>>>> -			vfio_unpin_page_external(dma, iova, do_accounting);
>>>>>> +			vfio_unpin_page_external(dma, iova, do_accounting,
>>>>>> +						 false);
>>>>>>     			goto pin_unwind;
>>>>>>     		}
>>>>>>     	}
>>
>> <snip>
>>
>>>>   
>>>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
>>>>>> +			iommu->dirty_page_tracking = true;
>>>>>> +			return 0;
>>>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
>>>>>> +			iommu->dirty_page_tracking = false;
>>>>>> +
>>>>>> +			mutex_lock(&iommu->lock);
>>>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
>>>>>> +			mutex_unlock(&iommu->lock);
>>>>>> +			return 0;
>>>>>> +
>>>>>> +		} else if (range.flags &
>>>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
>>>>>> +			uint64_t iommu_pgmask;
>>>>>> +			unsigned long pgshift = __ffs(range.pgsize);
>>>>>> +			unsigned long *bitmap;
>>>>>> +			long bsize;
>>>>>> +
>>>>>> +			iommu_pgmask =
>>>>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>>>>>> +
>>>>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
>>>>>> +			    (range.pgsize - 1))
>>>>>> +				return -EINVAL;
>>>>>> +
>>>>>> +			if (range.iova & iommu_pgmask)
>>>>>> +				return -EINVAL;
>>>>>> +			if (!range.size || range.size > SIZE_MAX)
>>>>>> +				return -EINVAL;
>>>>>> +			if (range.iova + range.size < range.iova)
>>>>>> +				return -EINVAL;
>>>>>> +
>>>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
>>>>>> +						   range.bitmap_size);
>>>>>> +			if (bsize < 0)
>>>>>> +				return ret;
>>>>>> +
>>>>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);
>>>>>
>>>>> I think I remember mentioning previously that we cannot allocate an
>>>>> arbitrary buffer on behalf of the user, it's far too easy for them to
>>>>> kill the kernel that way.  kmalloc is also limited in what it can
>>>>> alloc.
>>>>
>>>> That's the reason added verify_bitmap_size(), so that size is verified
>>>
>>> That's only a consistency test, it only verifies that the user claims
>>> to provide a bitmap sized sufficiently for the range they're trying to
>>> request.  range.size is limited to SIZE_MAX, so 2^64, divided by page
>>> size for 2^52 bits, 8bits per byte for 2^49 bytes of bitmap that we'd
>>> try to kmalloc (512TB).  kmalloc is good for a couple MB AIUI.
>>> Meanwhile the user doesn't actually need to allocate that bitmap in
>>> order to crash the kernel.
>>>    
>>>>> Can't we use the user buffer directly or only work on a part of
>>>>> it at a time?
>>>>>       
>>>>
>>>> without copy_from_user(), how?
>>>
>>> For starters, what's the benefit of copying the bitmap from the user
>>> in the first place?  We presume the data is zero'd and if it's not,
>>> that's the user's bug to sort out (we just need to define the API to
>>> specify that).  Therefore the copy_from_user() is unnecessary anyway and
>>> we really just need to copy_to_user() for any places we're setting
>>> bits.  We could just walk through the range with an unsigned long
>>> bitmap buffer, writing it out to userspace any time we reach the end
>>> with bits set, zeroing it and shifting it as a window to the user
>>> buffer.  We could improve batching by allocating a larger buffer in the
>>> kernel, with a kernel defined maximum size and performing the same
>>> windowing scheme.
>>>    
>>
>> Ok removing copy_from_user().
>> But AFAIK, calling copy_to_user() multiple times is not efficient in
>> terms of performance.
> 
> Right, but even with a modestly sized internal buffer for batching we
> can cover quite a large address space.  128MB for a 4KB buffer, 32GB
> with 1MB buffer.  __put_user() is more lightweight than copy_to_user(),
> I wonder where the inflection point is in batching the latter versus
> more iterations of the former.
> 
>> Checked code in virt/kvm/kvm_main.c: __kvm_set_memory_region() where
>> dirty_bitmap is allocated, that has generic checks, user space address
>> check, memory overflow check and KVM_MEM_MAX_NR_PAGES as below. I'll add
>> access_ok check. I already added overflow check.
>>
>>           /* General sanity checks */
>>           if (mem->memory_size & (PAGE_SIZE - 1))
>>                   goto out;
>>
>>          !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>>                           mem->memory_size)))
>>
>>           if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
>>                   goto out;
>>
>>           if (npages > KVM_MEM_MAX_NR_PAGES)
>>                   goto out;
>>
>>
>> Where KVM_MEM_MAX_NR_PAGES is:
>>
>> /*
>>    * Some of the bitops functions do not support too long bitmaps.
>>    * This number must be determined not to exceed such limits.
>>    */
>> #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
>>
>> But we can't use KVM specific KVM_MEM_MAX_NR_PAGES check in vfio module.
>> Should we define similar limit in vfio module instead of SIZE_MAX?
> 
> If we have ranges that are up to 2^31 pages, that's still 2^28 bytes.
> Does it seem reasonable to have a kernel interface that potentially
> allocates 256MB of kernel space with kmalloc accessible to users?  That
> still seems like a DoS attack vector to me, especially since the user
> doesn't need to be able to map that much memory (8TB) to access it.
> 
> I notice that KVM allocate the bitmap (kvzalloc) relative to the actual
> size of the memory slot when dirty logging is enabled, maybe that's the
> right approach rather than walking vpfn lists and maintaining unpinned
> vpfns for the purposes of tracking.  For example, when dirty logging is
> enabled, walk all vfio_dmas and allocate a dirty bitmap anywhere the
> vpfn list is not empty and walk the vpfn list to set dirty bits in the
> bitmap. 

Bitmap will be allocated per vfio_dma, not as per 
VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP request, right?

> When new pages are pinned, allocate a bitmap if not already
> present and set the dirty bit.  When unpinned, update the vpfn list but
> leave the dirty bit set.  When the dirty bitmap is read, copy out the
> current bitmap to the user, memset it to zero, then re-walk the vpfn
> list to set currently dirty pages.

Why re-walk is required again? Pinning /unpinning or reporting dirty 
pages are done holding iommu->lock, there shouldn't be race condition.

>  A vfio_dma without a dirty bitmap
> would consider the entire range dirty.

That will depend on (!iommu->pinned_page_dirty_scope && 
dma->iommu_mapped) condition to mark entire range dirty.
Here even if vpfn list is empty, memory for dirty_bitmap need to be 
allocated, memset all bits to 1, then copy_to_user().

If we go with this approach, then I think there should be restriction to 
get bitmap as per the way mappings were created, multiple mappings can 
be clubbed together, but shouldn't bisect the mappings - similar to 
unmap restriction.

Thanks,
Kirti

>  At least that way the overhead
> of the bitmap is just that, overhead rather than a future exploit.
> Does this seem like a better approach?  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
  2020-01-09 13:29             ` Kirti Wankhede
@ 2020-01-09 14:53               ` Alex Williamson
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Williamson @ 2020-01-09 14:53 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, kevin.tian, ziye.yang, changpeng.liu, yi.l.liu, mlevitsk,
	eskultet, cohuck, dgilbert, jonathan.davies, eauger, aik, pasic,
	felipe, Zhengxiao.zx, shuangtai.tst, Ken.Xue, zhi.a.wang,
	yan.y.zhao, qemu-devel, kvm

On Thu, 9 Jan 2020 18:59:40 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 1/9/2020 3:59 AM, Alex Williamson wrote:
> > On Thu, 9 Jan 2020 01:31:16 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 1/8/2020 3:32 AM, Alex Williamson wrote:  
> >>> On Wed, 8 Jan 2020 01:37:03 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>
> >> <snip>
> >>  
> >>>>>> +
> >>>>>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
> >>>>>>     
> >>>>>>     	if (do_accounting)
> >>>>>>     		vfio_lock_acct(dma, -unlocked, true);
> >>>>>> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>>>     
> >>>>>>     		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>>>>>     		if (vpfn) {
> >>>>>> -			phys_pfn[i] = vpfn->pfn;
> >>>>>> -			continue;
> >>>>>> +			if (vpfn->unpinned)
> >>>>>> +				vfio_remove_from_pfn_list(dma, vpfn);  
> >>>>>
> >>>>> This seems inefficient, we have an allocated vpfn at the right places
> >>>>> in the list, wouldn't it be better to repin the page?
> >>>>>         
> >>>>
> >>>> vfio_pin_page_external() takes care of pinning and accounting as well.  
> >>>
> >>> Yes, but could we call vfio_pin_page_external() without {unlinking,
> >>> freeing} and {re-allocating, linking} on either side of it since it's
> >>> already in the list?  That's the inefficient part.  Maybe at least a
> >>> TODO comment?
> >>>      
> >>
> >> Changing it as below:
> >>
> >>                   vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>                   if (vpfn) {
> >> -                       phys_pfn[i] = vpfn->pfn;
> >> -                       continue;
> >> +                       if (vpfn->ref_count > 1) {
> >> +                               phys_pfn[i] = vpfn->pfn;
> >> +                               continue;
> >> +                       }
> >>                   }
> >>
> >>                   remote_vaddr = dma->vaddr + iova - dma->iova;
> >>                   ret = vfio_pin_page_external(dma, remote_vaddr,
> >> &phys_pfn[i],
> >>                                                do_accounting);
> >>                   if (ret)
> >>                           goto pin_unwind;
> >> -
> >> -               ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >> -               if (ret) {
> >> -                       vfio_unpin_page_external(dma, iova, do_accounting);
> >> -                       goto pin_unwind;
> >> -               }
> >> +               if (!vpfn) {
> >> +                       ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >> +                       if (ret) {
> >> +                               vfio_unpin_page_external(dma, iova,
> >> +                                                        do_accounting,
> >> false);
> >> +                               goto pin_unwind;
> >> +                       }
> >> +               } else
> >> +                       vpfn->pfn = phys_pfn[i];
> >>           }
> >>
> >>
> >>
> >>  
> >>>>>> +			else {
> >>>>>> +				phys_pfn[i] = vpfn->pfn;
> >>>>>> +				continue;
> >>>>>> +			}
> >>>>>>     		}
> >>>>>>     
> >>>>>>     		remote_vaddr = dma->vaddr + iova - dma->iova;
> >>>>>> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>>>     
> >>>>>>     		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >>>>>>     		if (ret) {
> >>>>>> -			vfio_unpin_page_external(dma, iova, do_accounting);
> >>>>>> +			vfio_unpin_page_external(dma, iova, do_accounting,
> >>>>>> +						 false);
> >>>>>>     			goto pin_unwind;
> >>>>>>     		}
> >>>>>>     	}  
> >>
> >> <snip>
> >>  
> >>>>     
> >>>>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >>>>>> +			iommu->dirty_page_tracking = true;
> >>>>>> +			return 0;
> >>>>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >>>>>> +			iommu->dirty_page_tracking = false;
> >>>>>> +
> >>>>>> +			mutex_lock(&iommu->lock);
> >>>>>> +			vfio_remove_unpinned_from_dma_list(iommu);
> >>>>>> +			mutex_unlock(&iommu->lock);
> >>>>>> +			return 0;
> >>>>>> +
> >>>>>> +		} else if (range.flags &
> >>>>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >>>>>> +			uint64_t iommu_pgmask;
> >>>>>> +			unsigned long pgshift = __ffs(range.pgsize);
> >>>>>> +			unsigned long *bitmap;
> >>>>>> +			long bsize;
> >>>>>> +
> >>>>>> +			iommu_pgmask =
> >>>>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >>>>>> +
> >>>>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> >>>>>> +			    (range.pgsize - 1))
> >>>>>> +				return -EINVAL;
> >>>>>> +
> >>>>>> +			if (range.iova & iommu_pgmask)
> >>>>>> +				return -EINVAL;
> >>>>>> +			if (!range.size || range.size > SIZE_MAX)
> >>>>>> +				return -EINVAL;
> >>>>>> +			if (range.iova + range.size < range.iova)
> >>>>>> +				return -EINVAL;
> >>>>>> +
> >>>>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >>>>>> +						   range.bitmap_size);
> >>>>>> +			if (bsize < 0)
> >>>>>> +				return ret;
> >>>>>> +
> >>>>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);  
> >>>>>
> >>>>> I think I remember mentioning previously that we cannot allocate an
> >>>>> arbitrary buffer on behalf of the user, it's far too easy for them to
> >>>>> kill the kernel that way.  kmalloc is also limited in what it can
> >>>>> alloc.  
> >>>>
> >>>> That's the reason added verify_bitmap_size(), so that size is verified  
> >>>
> >>> That's only a consistency test, it only verifies that the user claims
> >>> to provide a bitmap sized sufficiently for the range they're trying to
> >>> request.  range.size is limited to SIZE_MAX, so 2^64, divided by page
> >>> size for 2^52 bits, 8bits per byte for 2^49 bytes of bitmap that we'd
> >>> try to kmalloc (512TB).  kmalloc is good for a couple MB AIUI.
> >>> Meanwhile the user doesn't actually need to allocate that bitmap in
> >>> order to crash the kernel.
> >>>      
> >>>>> Can't we use the user buffer directly or only work on a part of
> >>>>> it at a time?
> >>>>>         
> >>>>
> >>>> without copy_from_user(), how?  
> >>>
> >>> For starters, what's the benefit of copying the bitmap from the user
> >>> in the first place?  We presume the data is zero'd and if it's not,
> >>> that's the user's bug to sort out (we just need to define the API to
> >>> specify that).  Therefore the copy_from_user() is unnecessary anyway and
> >>> we really just need to copy_to_user() for any places we're setting
> >>> bits.  We could just walk through the range with an unsigned long
> >>> bitmap buffer, writing it out to userspace any time we reach the end
> >>> with bits set, zeroing it and shifting it as a window to the user
> >>> buffer.  We could improve batching by allocating a larger buffer in the
> >>> kernel, with a kernel defined maximum size and performing the same
> >>> windowing scheme.
> >>>      
> >>
> >> Ok removing copy_from_user().
> >> But AFAIK, calling copy_to_user() multiple times is not efficient in
> >> terms of performance.  
> > 
> > Right, but even with a modestly sized internal buffer for batching we
> > can cover quite a large address space.  128MB for a 4KB buffer, 32GB
> > with 1MB buffer.  __put_user() is more lightweight than copy_to_user(),
> > I wonder where the inflection point is in batching the latter versus
> > more iterations of the former.
> >   
> >> Checked code in virt/kvm/kvm_main.c: __kvm_set_memory_region() where
> >> dirty_bitmap is allocated, that has generic checks, user space address
> >> check, memory overflow check and KVM_MEM_MAX_NR_PAGES as below. I'll add
> >> access_ok check. I already added overflow check.
> >>
> >>           /* General sanity checks */
> >>           if (mem->memory_size & (PAGE_SIZE - 1))
> >>                   goto out;
> >>
> >>          !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> >>                           mem->memory_size)))
> >>
> >>           if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> >>                   goto out;
> >>
> >>           if (npages > KVM_MEM_MAX_NR_PAGES)
> >>                   goto out;
> >>
> >>
> >> Where KVM_MEM_MAX_NR_PAGES is:
> >>
> >> /*
> >>    * Some of the bitops functions do not support too long bitmaps.
> >>    * This number must be determined not to exceed such limits.
> >>    */
> >> #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
> >>
> >> But we can't use KVM specific KVM_MEM_MAX_NR_PAGES check in vfio module.
> >> Should we define similar limit in vfio module instead of SIZE_MAX?  
> > 
> > If we have ranges that are up to 2^31 pages, that's still 2^28 bytes.
> > Does it seem reasonable to have a kernel interface that potentially
> > allocates 256MB of kernel space with kmalloc accessible to users?  That
> > still seems like a DoS attack vector to me, especially since the user
> > doesn't need to be able to map that much memory (8TB) to access it.
> > 
> > I notice that KVM allocate the bitmap (kvzalloc) relative to the actual
> > size of the memory slot when dirty logging is enabled, maybe that's the
> > right approach rather than walking vpfn lists and maintaining unpinned
> > vpfns for the purposes of tracking.  For example, when dirty logging is
> > enabled, walk all vfio_dmas and allocate a dirty bitmap anywhere the
> > vpfn list is not empty and walk the vpfn list to set dirty bits in the
> > bitmap.   
> 
> Bitmap will be allocated per vfio_dma, not as per 
> VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP request, right?

Per vfio_dma when dirty logging is enabled, ie. between
VFIO_IOMMU_DIRTY_PAGES_FLAG_START and VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP.

> > When new pages are pinned, allocate a bitmap if not already
> > present and set the dirty bit.  When unpinned, update the vpfn list but
> > leave the dirty bit set.  When the dirty bitmap is read, copy out the
> > current bitmap to the user, memset it to zero, then re-walk the vpfn
> > list to set currently dirty pages.  
> 
> Why re-walk is required again? Pinning /unpinning or reporting dirty 
> pages are done holding iommu->lock, there shouldn't be race condition.

In order to "remember" that a page was dirtied, I proposed above that
we don't change the bitmap when a page is unpinned.  We can "forget"
that a page was dirtied if it's no longer pinned and we've told the
user about it.  Therefore we need to purge the not-currently-pinned
pages from the bitmap and rebuild it.

> >  A vfio_dma without a dirty bitmap
> > would consider the entire range dirty.  
> 
> That will depend on (!iommu->pinned_page_dirty_scope && 
> dma->iommu_mapped) condition to mark entire range dirty.

I assumed we wouldn't bother to maintain a bitmap unless these
conditions are already met.

> Here even if vpfn list is empty, memory for dirty_bitmap need to be 
> allocated, memset all bits to 1, then copy_to_user().

I was assuming we might use a __put_user() loop to fill such ranges
rather than waste memory tracking fully populated bitmaps.

> If we go with this approach, then I think there should be restriction to 
> get bitmap as per the way mappings were created, multiple mappings can 
> be clubbed together, but shouldn't bisect the mappings - similar to 
> unmap restriction.

Why?  It seems like it's just some pointer arithmetic to copy out the
section of the bitmap that the user requests.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-01-09 14:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-17 17:10 [PATCH v11 Kernel 0/6] KABIs to support migration for VFIO devices Kirti Wankhede
2019-12-17 17:10 ` [PATCH v11 Kernel 1/6] vfio: KABI for migration interface for device state Kirti Wankhede
2019-12-17 17:10 ` [PATCH v11 Kernel 2/6] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2019-12-17 17:10 ` [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to " Kirti Wankhede
2019-12-17 22:12   ` Alex Williamson
2020-01-07 20:07     ` Kirti Wankhede
2020-01-07 22:02       ` Alex Williamson
2020-01-08 20:01         ` Kirti Wankhede
2020-01-08 22:29           ` Alex Williamson
2020-01-09 13:29             ` Kirti Wankhede
2020-01-09 14:53               ` Alex Williamson
2019-12-17 17:10 ` [PATCH v11 Kernel 4/6] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2019-12-17 22:55   ` Alex Williamson
2019-12-17 17:10 ` [PATCH v11 Kernel 5/6] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2019-12-17 17:10 ` [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
2019-12-18  0:12   ` Alex Williamson
2020-01-07 20:45     ` Kirti Wankhede
2020-01-08  0:09       ` Alex Williamson
2020-01-08 20:52         ` Kirti Wankhede
2020-01-08 22:59           ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).