netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver
@ 2022-07-14  8:12 Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 01/11] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
                   ` (11 more replies)
  0 siblings, 12 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

This series adds device DMA logging uAPIs and their implementation as
part of mlx5 driver.

DMA logging allows a device to internally record what DMAs the device is
initiating and report them back to userspace. It is part of the VFIO
migration infrastructure that allows implementing dirty page tracking
during the pre copy phase of live migration. Only DMA WRITEs are logged,
and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.

The uAPIs are based on the FEATURE ioctl as were introduced earlier by
the below RFC [1] and follows the notes that were discussed in the
mailing list.

It includes:
- A PROBE option to detect if the device supports DMA logging.
- A SET option to start device DMA logging in given IOVAs ranges.
- A GET option to read back and clear the device DMA log.
- A SET option to stop device DMA logging that was previously started.

Extra details exist as part of relevant patches in the series.

In addition, the series adds some infrastructure support for managing an
IOVA bitmap done by Joao Martins.

It abstracts how an IOVA range is represented in a bitmap that is
granulated by a given page_size. So it translates all the lifting of
dealing with user pointers into its corresponding kernel addresses
backing said user memory into doing finally the bitmap ops to change
various bits.

This functionality will be used as part of IOMMUFD series for the system
IOMMU tracking.

Finally, we come with mlx5 implementation based on its device
specification for the DMA logging APIs.

The matching qemu changes can be previewed here [2].
They come on top of the v2 migration protocol patches that were sent
already to the mailing list.

Few notes:
- The first 2 patches were sent already separately, as the series relies
  on add them here as well.

- As this series touched mlx5_core parts we may need to send the
  net/mlx5 patches as a pull request format to VFIO to avoid conflicts
  before acceptance.

[1] https://lore.kernel.org/all/20220501123301.127279-1-yishaih@nvidia.com/T/
[2] https://github.com/avihai1122/qemu/commits/device_dirty_tracking

Changes from V1: https://lore.kernel.org/netdev/202207052209.x00Iykkp-lkp@intel.com/T/

- Patch #6: Fix a note given by krobot, select INTERVAL_TREE for VFIO.

Changes from V0: https://lore.kernel.org/netdev/202207011231.1oPQhSzo-lkp@intel.com/T/

- Drop the first 2 patches that Alex merged already.
- Fix a note given by krobot, based on Jason's suggestion.
- Some improvements from Joao for his IOVA bitmap patch to be
  cleaner/simpler. It includes the below:
    * Rename iova_bitmap_array_length to iova_bitmap_iova_to_index.
    * Rename iova_bitmap_index_to_length to iova_bitmap_index_to_iova.
    * Change iova_bitmap_iova_to_index to take an iova_bitmap_iter
      as an argument to pair with iova_bitmap_index_to_length.
    * Make iova_bitmap_iter_done() use >= instead of
      substraction+comparison. This fixes iova_bitmap_iter_done()
      return as it was previously returning when !done.
    * Remove iova_bitmap_iter_length().
    * Simplify iova_bitmap_length() overcomplicated trailing end check
    * Convert all sizeof(u64) into sizeof(*iter->data).
    * Use u64 __user for ::data instead of void in both struct and
      initialization of iova_bitmap.

Yishai

Jason Gunthorpe (1):
  vfio: Move vfio.c to vfio_main.c

Joao Martins (1):
  vfio: Add an IOVA bitmap support

Yishai Hadas (9):
  net/mlx5: Introduce ifc bits for page tracker
  net/mlx5: Query ADV_VIRTUALIZATION capabilities
  vfio: Introduce DMA logging uAPIs
  vfio: Introduce the DMA logging feature support
  vfio/mlx5: Init QP based resources for dirty tracking
  vfio/mlx5: Create and destroy page tracker object
  vfio/mlx5: Report dirty pages from tracker
  vfio/mlx5: Manage error scenarios on tracker
  vfio/mlx5: Set the driver DMA logging callbacks

 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   6 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |   1 +
 drivers/vfio/Kconfig                          |   1 +
 drivers/vfio/Makefile                         |   4 +
 drivers/vfio/iova_bitmap.c                    | 164 +++
 drivers/vfio/pci/mlx5/cmd.c                   | 995 +++++++++++++++++-
 drivers/vfio/pci/mlx5/cmd.h                   |  63 +-
 drivers/vfio/pci/mlx5/main.c                  |   9 +-
 drivers/vfio/pci/vfio_pci_core.c              |   5 +
 drivers/vfio/{vfio.c => vfio_main.c}          | 161 +++
 include/linux/iova_bitmap.h                   |  46 +
 include/linux/mlx5/device.h                   |   9 +
 include/linux/mlx5/mlx5_ifc.h                 |  79 +-
 include/linux/vfio.h                          |  21 +-
 include/uapi/linux/vfio.h                     |  79 ++
 15 files changed, 1625 insertions(+), 18 deletions(-)
 create mode 100644 drivers/vfio/iova_bitmap.c
 rename drivers/vfio/{vfio.c => vfio_main.c} (93%)
 create mode 100644 include/linux/iova_bitmap.h

-- 
2.18.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 01/11] net/mlx5: Introduce ifc bits for page tracker
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-21  8:28   ` Tian, Kevin
  2022-07-14  8:12 ` [PATCH V2 vfio 02/11] net/mlx5: Query ADV_VIRTUALIZATION capabilities Yishai Hadas
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Introduce ifc related stuff to enable using page tracker.

A page tracker is a dirty page tracking object used by the device to
report the tracking log.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 79 ++++++++++++++++++++++++++++++++++-
 1 file changed, 78 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index fd7d083a34d3..b2d56fea6a09 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -89,6 +89,7 @@ enum {
 	MLX5_OBJ_TYPE_VIRTIO_NET_Q = 0x000d,
 	MLX5_OBJ_TYPE_VIRTIO_Q_COUNTERS = 0x001c,
 	MLX5_OBJ_TYPE_MATCH_DEFINER = 0x0018,
+	MLX5_OBJ_TYPE_PAGE_TRACK = 0x46,
 	MLX5_OBJ_TYPE_MKEY = 0xff01,
 	MLX5_OBJ_TYPE_QP = 0xff02,
 	MLX5_OBJ_TYPE_PSV = 0xff03,
@@ -1711,7 +1712,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         max_geneve_tlv_options[0x8];
 	u8         reserved_at_568[0x3];
 	u8         max_geneve_tlv_option_data_len[0x5];
-	u8         reserved_at_570[0x10];
+	u8         reserved_at_570[0x9];
+	u8         adv_virtualization[0x1];
+	u8         reserved_at_57a[0x6];
 
 	u8	   reserved_at_580[0xb];
 	u8	   log_max_dci_stream_channels[0x5];
@@ -11668,4 +11671,78 @@ struct mlx5_ifc_load_vhca_state_out_bits {
 	u8         reserved_at_40[0x40];
 };
 
+struct mlx5_ifc_adv_virtualization_cap_bits {
+	u8         reserved_at_0[0x3];
+	u8         pg_track_log_max_num[0x5];
+	u8         pg_track_max_num_range[0x8];
+	u8         pg_track_log_min_addr_space[0x8];
+	u8         pg_track_log_max_addr_space[0x8];
+
+	u8         reserved_at_20[0x3];
+	u8         pg_track_log_min_msg_size[0x5];
+	u8         pg_track_log_max_msg_size[0x8];
+	u8         pg_track_log_min_page_size[0x8];
+	u8         pg_track_log_max_page_size[0x8];
+
+	u8         reserved_at_40[0x7c0];
+};
+
+struct mlx5_ifc_page_track_report_entry_bits {
+	u8         dirty_address_high[0x20];
+
+	u8         dirty_address_low[0x20];
+};
+
+enum {
+	MLX5_PAGE_TRACK_STATE_TRACKING,
+	MLX5_PAGE_TRACK_STATE_REPORTING,
+	MLX5_PAGE_TRACK_STATE_ERROR,
+};
+
+struct mlx5_ifc_page_track_range_bits {
+	u8         start_address[0x40];
+
+	u8         length[0x40];
+};
+
+struct mlx5_ifc_page_track_bits {
+	u8         modify_field_select[0x40];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         state[0x4];
+	u8         track_type[0x4];
+	u8         log_addr_space_size[0x8];
+	u8         reserved_at_90[0x3];
+	u8         log_page_size[0x5];
+	u8         reserved_at_98[0x3];
+	u8         log_msg_size[0x5];
+
+	u8         reserved_at_a0[0x8];
+	u8         reporting_qpn[0x18];
+
+	u8         reserved_at_c0[0x18];
+	u8         num_ranges[0x8];
+
+	u8         reserved_at_e0[0x20];
+
+	u8         range_start_address[0x40];
+
+	u8         length[0x40];
+
+	struct     mlx5_ifc_page_track_range_bits track_range[0];
+};
+
+struct mlx5_ifc_create_page_track_obj_in_bits {
+	struct mlx5_ifc_general_obj_in_cmd_hdr_bits general_obj_in_cmd_hdr;
+	struct mlx5_ifc_page_track_bits obj_context;
+};
+
+struct mlx5_ifc_modify_page_track_obj_in_bits {
+	struct mlx5_ifc_general_obj_in_cmd_hdr_bits general_obj_in_cmd_hdr;
+	struct mlx5_ifc_page_track_bits obj_context;
+};
 #endif /* MLX5_IFC_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 02/11] net/mlx5: Query ADV_VIRTUALIZATION capabilities
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 01/11] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs Yishai Hadas
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Query ADV_VIRTUALIZATION capabilities which provide information for
advanced virtualization related features.

Current capabilities refer to the page tracker object which is used for
tracking the pages that are dirtied by the device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fw.c   | 6 ++++++
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 1 +
 include/linux/mlx5/device.h                    | 9 +++++++++
 3 files changed, 16 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw.c b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
index cfb8bedba512..45b9891b7947 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
@@ -273,6 +273,12 @@ int mlx5_query_hca_caps(struct mlx5_core_dev *dev)
 			return err;
 	}
 
+	if (MLX5_CAP_GEN(dev, adv_virtualization)) {
+		err = mlx5_core_get_caps(dev, MLX5_CAP_ADV_VIRTUALIZATION);
+		if (err)
+			return err;
+	}
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index c9b4e50a593e..5ecaaee2624c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1432,6 +1432,7 @@ static const int types[] = {
 	MLX5_CAP_IPSEC,
 	MLX5_CAP_PORT_SELECTION,
 	MLX5_CAP_DEV_SHAMPO,
+	MLX5_CAP_ADV_VIRTUALIZATION,
 };
 
 static void mlx5_hca_caps_free(struct mlx5_core_dev *dev)
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 604b85dd770a..96ea0c1796f8 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -1204,6 +1204,7 @@ enum mlx5_cap_type {
 	MLX5_CAP_DEV_SHAMPO = 0x1d,
 	MLX5_CAP_GENERAL_2 = 0x20,
 	MLX5_CAP_PORT_SELECTION = 0x25,
+	MLX5_CAP_ADV_VIRTUALIZATION = 0x26,
 	/* NUM OF CAP Types */
 	MLX5_CAP_NUM
 };
@@ -1369,6 +1370,14 @@ enum mlx5_qcam_feature_groups {
 	MLX5_GET(port_selection_cap, \
 		 mdev->caps.hca[MLX5_CAP_PORT_SELECTION]->max, cap)
 
+#define MLX5_CAP_ADV_VIRTUALIZATION(mdev, cap) \
+	MLX5_GET(adv_virtualization_cap, \
+		 mdev->caps.hca[MLX5_CAP_ADV_VIRTUALIZATION]->cur, cap)
+
+#define MLX5_CAP_ADV_VIRTUALIZATION_MAX(mdev, cap) \
+	MLX5_GET(adv_virtualization_cap, \
+		 mdev->caps.hca[MLX5_CAP_ADV_VIRTUALIZATION]->max, cap)
+
 #define MLX5_CAP_FLOWTABLE_PORT_SELECTION(mdev, cap) \
 	MLX5_CAP_PORT_SELECTION(mdev, flow_table_properties_port_selection.cap)
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 01/11] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 02/11] net/mlx5: Query ADV_VIRTUALIZATION capabilities Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-18 22:29   ` Alex Williamson
  2022-07-21  8:45   ` Tian, Kevin
  2022-07-14  8:12 ` [PATCH V2 vfio 04/11] vfio: Move vfio.c to vfio_main.c Yishai Hadas
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

DMA logging allows a device to internally record what DMAs the device is
initiating and report them back to userspace. It is part of the VFIO
migration infrastructure that allows implementing dirty page tracking
during the pre copy phase of live migration. Only DMA WRITEs are logged,
and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.

This patch introduces the DMA logging involved uAPIs.

It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.

It exposes a PROBE option to detect if the device supports DMA logging.
It exposes a SET option to start device DMA logging in given IOVAs
ranges.
It exposes a SET option to stop device DMA logging that was previously
started.
It exposes a GET option to read back and clear the device DMA log.

Extra details exist as part of vfio.h per a specific option.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/uapi/linux/vfio.h | 79 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30..81475c3e7c92 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,85 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
+ * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
+ * DMA logging.
+ *
+ * DMA logging allows a device to internally record what DMAs the device is
+ * initiating and report them back to userspace. It is part of the VFIO
+ * migration infrastructure that allows implementing dirty page tracking
+ * during the pre copy phase of live migration. Only DMA WRITEs are logged,
+ * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
+ *
+ * When DMA logging is started a range of IOVAs to monitor is provided and the
+ * device can optimize its logging to cover only the IOVA range given. Each
+ * DMA that the device initiates inside the range will be logged by the device
+ * for later retrieval.
+ *
+ * page_size is an input that hints what tracking granularity the device
+ * should try to achieve. If the device cannot do the hinted page size then it
+ * should pick the next closest page size it supports. On output the device
+ * will return the page size it selected.
+ *
+ * ranges is a pointer to an array of
+ * struct vfio_device_feature_dma_logging_range.
+ */
+struct vfio_device_feature_dma_logging_control {
+	__aligned_u64 page_size;
+	__u32 num_ranges;
+	__u32 __reserved;
+	__aligned_u64 ranges;
+};
+
+struct vfio_device_feature_dma_logging_range {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
+ * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
+ */
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
+ *
+ * Query the device's DMA log for written pages within the given IOVA range.
+ * During querying the log is cleared for the IOVA range.
+ *
+ * bitmap is a pointer to an array of u64s that will hold the output bitmap
+ * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
+ * is given by:
+ *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
+ *
+ * The input page_size can be any power of two value and does not have to
+ * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
+ * will format its internal logging to match the reporting page size, possibly
+ * by replicating bits if the internal page size is lower than requested.
+ *
+ * Bits will be updated in bitmap using atomic or to allow userspace to
+ * combine bitmaps from multiple trackers together. Therefore userspace must
+ * zero the bitmap before doing any reports.
+ *
+ * If any error is returned userspace should assume that the dirty log is
+ * corrupted and restart.
+ *
+ * If DMA logging is not enabled, an error will be returned.
+ *
+ */
+struct vfio_device_feature_dma_logging_report {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 bitmap;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 04/11] vfio: Move vfio.c to vfio_main.c
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (2 preceding siblings ...)
  2022-07-14  8:12 ` [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support Yishai Hadas
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

From: Jason Gunthorpe <jgg@nvidia.com>

If a source file has the same name as a module then kbuild only supports
a single source file in the module.

Rename vfio.c to vfio_main.c so that we can have more that one .c file
in vfio.ko.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>

---
 drivers/vfio/Makefile                | 2 ++
 drivers/vfio/{vfio.c => vfio_main.c} | 0
 2 files changed, 2 insertions(+)
 rename drivers/vfio/{vfio.c => vfio_main.c} (100%)

diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index fee73f3d9480..1a32357592e3 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 vfio_virqfd-y := virqfd.o
 
+vfio-y += vfio_main.o
+
 obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio_main.c
similarity index 100%
rename from drivers/vfio/vfio.c
rename to drivers/vfio/vfio_main.c
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (3 preceding siblings ...)
  2022-07-14  8:12 ` [PATCH V2 vfio 04/11] vfio: Move vfio.c to vfio_main.c Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-18 22:30   ` Alex Williamson
  2022-07-19 19:01   ` Alex Williamson
  2022-07-14  8:12 ` [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support Yishai Hadas
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

From: Joao Martins <joao.m.martins@oracle.com>

The new facility adds a bunch of wrappers that abstract how an IOVA
range is represented in a bitmap that is granulated by a given
page_size. So it translates all the lifting of dealing with user
pointers into its corresponding kernel addresses backing said user
memory into doing finally the bitmap ops to change various bits.

The formula for the bitmap is:

   data[(iova / page_size) / 64] & (1ULL << (iova % 64))

Where 64 is the number of bits in a unsigned long (depending on arch)

An example usage of these helpers for a given @iova, @page_size, @length
and __user @data:

	iova_bitmap_init(&iter.dirty, iova, __ffs(page_size));
	ret = iova_bitmap_iter_init(&iter, iova, length, data);
	if (ret)
		return -ENOMEM;

	for (; !iova_bitmap_iter_done(&iter);
	     iova_bitmap_iter_advance(&iter)) {
		ret = iova_bitmap_iter_get(&iter);
		if (ret)
			break;
		if (dirty)
			iova_bitmap_set(iova_bitmap_iova(&iter),
					iova_bitmap_iova_length(&iter),
					&iter.dirty);

		iova_bitmap_iter_put(&iter);

		if (ret)
			break;
	}

	iova_bitmap_iter_free(&iter);

The facility is intended to be used for user bitmaps representing
dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/Makefile       |   6 +-
 drivers/vfio/iova_bitmap.c  | 164 ++++++++++++++++++++++++++++++++++++
 include/linux/iova_bitmap.h |  46 ++++++++++
 3 files changed, 214 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/iova_bitmap.c
 create mode 100644 include/linux/iova_bitmap.h

diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 1a32357592e3..1d6cad32d366 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,9 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 vfio_virqfd-y := virqfd.o
 
-vfio-y += vfio_main.o
-
 obj-$(CONFIG_VFIO) += vfio.o
+
+vfio-y := vfio_main.o \
+          iova_bitmap.o \
+
 obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
new file mode 100644
index 000000000000..9ad1533a6aec
--- /dev/null
+++ b/drivers/vfio/iova_bitmap.c
@@ -0,0 +1,164 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022, Oracle and/or its affiliates.
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/iova_bitmap.h>
+
+static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
+					       unsigned long iova_length)
+{
+	unsigned long pgsize = 1 << iter->dirty.pgshift;
+
+	return DIV_ROUND_UP(iova_length, BITS_PER_TYPE(*iter->data) * pgsize);
+}
+
+static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
+					       unsigned long index)
+{
+	unsigned long pgshift = iter->dirty.pgshift;
+
+	return (index * sizeof(*iter->data) * BITS_PER_BYTE) << pgshift;
+}
+
+static unsigned long iova_bitmap_iter_left(struct iova_bitmap_iter *iter)
+{
+	unsigned long left = iter->count - iter->offset;
+
+	left = min_t(unsigned long, left,
+		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));
+
+	return left;
+}
+
+/*
+ * Input argument of number of bits to bitmap_set() is unsigned integer, which
+ * further casts to signed integer for unaligned multi-bit operation,
+ * __bitmap_set().
+ * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
+ * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
+ * system.
+ */
+int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
+			  unsigned long iova, unsigned long length,
+			  u64 __user *data)
+{
+	struct iova_bitmap *dirty = &iter->dirty;
+
+	iter->data = data;
+	iter->offset = 0;
+	iter->count = iova_bitmap_iova_to_index(iter, length);
+	iter->iova = iova;
+	iter->length = length;
+	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
+
+	return !dirty->pages ? -ENOMEM : 0;
+}
+
+void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
+{
+	struct iova_bitmap *dirty = &iter->dirty;
+
+	if (dirty->pages) {
+		free_page((unsigned long)dirty->pages);
+		dirty->pages = NULL;
+	}
+}
+
+bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter)
+{
+	return iter->offset >= iter->count;
+}
+
+unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)
+{
+	unsigned long max_iova = iter->dirty.iova + iter->length;
+	unsigned long left = iova_bitmap_iter_left(iter);
+	unsigned long iova = iova_bitmap_iova(iter);
+
+	left = iova_bitmap_index_to_iova(iter, left);
+	if (iova + left > max_iova)
+		left -= ((iova + left) - max_iova);
+
+	return left;
+}
+
+unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
+{
+	unsigned long skip = iter->offset;
+
+	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
+}
+
+void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter)
+{
+	unsigned long length = iova_bitmap_length(iter);
+
+	iter->offset += iova_bitmap_iova_to_index(iter, length);
+}
+
+void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
+{
+	struct iova_bitmap *dirty = &iter->dirty;
+
+	if (dirty->npages)
+		unpin_user_pages(dirty->pages, dirty->npages);
+}
+
+int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
+{
+	struct iova_bitmap *dirty = &iter->dirty;
+	unsigned long npages;
+	u64 __user *addr;
+	long ret;
+
+	npages = DIV_ROUND_UP((iter->count - iter->offset) *
+			      sizeof(*iter->data), PAGE_SIZE);
+	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
+	addr = iter->data + iter->offset;
+	ret = pin_user_pages_fast((unsigned long)addr, npages,
+				  FOLL_WRITE, dirty->pages);
+	if (ret <= 0)
+		return ret;
+
+	dirty->npages = (unsigned long)ret;
+	dirty->iova = iova_bitmap_iova(iter);
+	dirty->start_offset = offset_in_page(addr);
+	return 0;
+}
+
+void iova_bitmap_init(struct iova_bitmap *bitmap,
+		      unsigned long base, unsigned long pgshift)
+{
+	memset(bitmap, 0, sizeof(*bitmap));
+	bitmap->iova = base;
+	bitmap->pgshift = pgshift;
+}
+
+unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
+			     unsigned long iova,
+			     unsigned long length)
+{
+	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
+
+	nbits = max(1UL, length >> dirty->pgshift);
+	offset = (iova - dirty->iova) >> dirty->pgshift;
+	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
+	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
+	start_offset = dirty->start_offset;
+
+	while (nbits > 0) {
+		kaddr = kmap_local_page(dirty->pages[idx]) + start_offset;
+		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
+		bitmap_set(kaddr, offset, size);
+		kunmap_local(kaddr - start_offset);
+		start_offset = offset = 0;
+		nbits -= size;
+		idx++;
+	}
+
+	return nbits;
+}
+EXPORT_SYMBOL_GPL(iova_bitmap_set);
+
diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
new file mode 100644
index 000000000000..c474c351634a
--- /dev/null
+++ b/include/linux/iova_bitmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022, Oracle and/or its affiliates.
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef _IOVA_BITMAP_H_
+#define _IOVA_BITMAP_H_
+
+#include <linux/highmem.h>
+#include <linux/mm.h>
+#include <linux/uio.h>
+
+struct iova_bitmap {
+	unsigned long iova;
+	unsigned long pgshift;
+	unsigned long start_offset;
+	unsigned long npages;
+	struct page **pages;
+};
+
+struct iova_bitmap_iter {
+	struct iova_bitmap dirty;
+	u64 __user *data;
+	size_t offset;
+	size_t count;
+	unsigned long iova;
+	unsigned long length;
+};
+
+int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
+			  unsigned long length, u64 __user *data);
+void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
+bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
+unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
+unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
+void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
+int iova_bitmap_iter_get(struct iova_bitmap_iter *iter);
+void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
+void iova_bitmap_init(struct iova_bitmap *bitmap,
+		      unsigned long base, unsigned long pgshift);
+unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
+			     unsigned long iova,
+			     unsigned long length);
+
+#endif
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (4 preceding siblings ...)
  2022-07-14  8:12 ` [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-18 22:30   ` Alex Williamson
  2022-07-14  8:12 ` [PATCH V2 vfio 07/11] vfio/mlx5: Init QP based resources for dirty tracking Yishai Hadas
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Introduce the DMA logging feature support in the vfio core layer.

It includes the processing of the device start/stop/report DMA logging
UAPIs and calling the relevant driver 'op' to do the work.

Specifically,
Upon start, the core translates the given input ranges into an interval
tree, checks for unexpected overlapping, non aligned ranges and then
pass the translated input to the driver for start tracking the given
ranges.

Upon report, the core translates the given input user space bitmap and
page size into an IOVA kernel bitmap iterator. Then it iterates it and
call the driver to set the corresponding bits for the dirtied pages in a
specific IOVA range.

Upon stop, the driver is called to stop the previous started tracking.

The next patches from the series will introduce the mlx5 driver
implementation for the logging ops.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/pci/vfio_pci_core.c |   5 +
 drivers/vfio/vfio_main.c         | 161 +++++++++++++++++++++++++++++++
 include/linux/vfio.h             |  21 +++-
 4 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 6130d00252ed..86c381ceb9a1 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -3,6 +3,7 @@ menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	select IOMMU_API
 	select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
+	select INTERVAL_TREE
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/driver-api/vfio.rst for more details.
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 2efa06b1fafa..b6dabf398251 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1862,6 +1862,11 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 			return -EINVAL;
 	}
 
+	if (vdev->vdev.log_ops && !(vdev->vdev.log_ops->log_start &&
+	    vdev->vdev.log_ops->log_stop &&
+	    vdev->vdev.log_ops->log_read_and_clear))
+		return -EINVAL;
+
 	/*
 	 * Prevent binding to PFs with VFs enabled, the VFs might be in use
 	 * by the host or other users.  We cannot capture the VFs if they
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index bd84ca7c5e35..2414d827e3c8 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -32,6 +32,8 @@
 #include <linux/vfio.h>
 #include <linux/wait.h>
 #include <linux/sched/signal.h>
+#include <linux/interval_tree.h>
+#include <linux/iova_bitmap.h>
 #include "vfio.h"
 
 #define DRIVER_VERSION	"0.3"
@@ -1603,6 +1605,153 @@ static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
 	return 0;
 }
 
+#define LOG_MAX_RANGES 1024
+
+static int
+vfio_ioctl_device_feature_logging_start(struct vfio_device *device,
+					u32 flags, void __user *arg,
+					size_t argsz)
+{
+	size_t minsz =
+		offsetofend(struct vfio_device_feature_dma_logging_control,
+			    ranges);
+	struct vfio_device_feature_dma_logging_range __user *ranges;
+	struct vfio_device_feature_dma_logging_control control;
+	struct vfio_device_feature_dma_logging_range range;
+	struct rb_root_cached root = RB_ROOT_CACHED;
+	struct interval_tree_node *nodes;
+	u32 nnodes;
+	int i, ret;
+
+	if (!device->log_ops)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_SET,
+				 sizeof(control));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&control, arg, minsz))
+		return -EFAULT;
+
+	nnodes = control.num_ranges;
+	if (!nnodes || nnodes > LOG_MAX_RANGES)
+		return -EINVAL;
+
+	ranges = u64_to_user_ptr(control.ranges);
+	nodes = kmalloc_array(nnodes, sizeof(struct interval_tree_node),
+			      GFP_KERNEL);
+	if (!nodes)
+		return -ENOMEM;
+
+	for (i = 0; i < nnodes; i++) {
+		if (copy_from_user(&range, &ranges[i], sizeof(range))) {
+			ret = -EFAULT;
+			goto end;
+		}
+		if (!IS_ALIGNED(range.iova, control.page_size) ||
+		    !IS_ALIGNED(range.length, control.page_size)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		nodes[i].start = range.iova;
+		nodes[i].last = range.iova + range.length - 1;
+		if (interval_tree_iter_first(&root, nodes[i].start,
+					     nodes[i].last)) {
+			/* Range overlapping */
+			ret = -EINVAL;
+			goto end;
+		}
+		interval_tree_insert(nodes + i, &root);
+	}
+
+	ret = device->log_ops->log_start(device, &root, nnodes,
+					 &control.page_size);
+	if (ret)
+		goto end;
+
+	if (copy_to_user(arg, &control, sizeof(control))) {
+		ret = -EFAULT;
+		device->log_ops->log_stop(device);
+	}
+
+end:
+	kfree(nodes);
+	return ret;
+}
+
+static int
+vfio_ioctl_device_feature_logging_stop(struct vfio_device *device,
+				       u32 flags, void __user *arg,
+				       size_t argsz)
+{
+	int ret;
+
+	if (!device->log_ops)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_SET, 0);
+	if (ret != 1)
+		return ret;
+
+	return device->log_ops->log_stop(device);
+}
+
+static int
+vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
+					 u32 flags, void __user *arg,
+					 size_t argsz)
+{
+	size_t minsz =
+		offsetofend(struct vfio_device_feature_dma_logging_report,
+			    bitmap);
+	struct vfio_device_feature_dma_logging_report report;
+	struct iova_bitmap_iter iter;
+	int ret;
+
+	if (!device->log_ops)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_GET,
+				 sizeof(report));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&report, arg, minsz))
+		return -EFAULT;
+
+	if (report.page_size < PAGE_SIZE)
+		return -EINVAL;
+
+	iova_bitmap_init(&iter.dirty, report.iova, ilog2(report.page_size));
+	ret = iova_bitmap_iter_init(&iter, report.iova, report.length,
+				    u64_to_user_ptr(report.bitmap));
+	if (ret)
+		return ret;
+
+	for (; !iova_bitmap_iter_done(&iter);
+	     iova_bitmap_iter_advance(&iter)) {
+		ret = iova_bitmap_iter_get(&iter);
+		if (ret)
+			break;
+
+		ret = device->log_ops->log_read_and_clear(device,
+			iova_bitmap_iova(&iter),
+			iova_bitmap_length(&iter), &iter.dirty);
+
+		iova_bitmap_iter_put(&iter);
+
+		if (ret)
+			break;
+	}
+
+	iova_bitmap_iter_free(&iter);
+	return ret;
+}
+
 static int vfio_ioctl_device_feature(struct vfio_device *device,
 				     struct vfio_device_feature __user *arg)
 {
@@ -1636,6 +1785,18 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
 		return vfio_ioctl_device_feature_mig_device_state(
 			device, feature.flags, arg->data,
 			feature.argsz - minsz);
+	case VFIO_DEVICE_FEATURE_DMA_LOGGING_START:
+		return vfio_ioctl_device_feature_logging_start(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
+	case VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP:
+		return vfio_ioctl_device_feature_logging_stop(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
+	case VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT:
+		return vfio_ioctl_device_feature_logging_report(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
 	default:
 		if (unlikely(!device->ops->device_feature))
 			return -EINVAL;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 4d26e149db81..feed84d686ec 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -14,6 +14,7 @@
 #include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
+#include <linux/iova_bitmap.h>
 
 struct kvm;
 
@@ -33,10 +34,11 @@ struct vfio_device {
 	struct device *dev;
 	const struct vfio_device_ops *ops;
 	/*
-	 * mig_ops is a static property of the vfio_device which must be set
-	 * prior to registering the vfio_device.
+	 * mig_ops/log_ops is a static property of the vfio_device which must
+	 * be set prior to registering the vfio_device.
 	 */
 	const struct vfio_migration_ops *mig_ops;
+	const struct vfio_log_ops *log_ops;
 	struct vfio_group *group;
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
@@ -104,6 +106,21 @@ struct vfio_migration_ops {
 				   enum vfio_device_mig_state *curr_state);
 };
 
+/**
+ * @log_start: Optional callback to ask the device start DMA logging.
+ * @log_stop: Optional callback to ask the device stop DMA logging.
+ * @log_read_and_clear: Optional callback to ask the device read
+ *         and clear the dirty DMAs in some given range.
+ */
+struct vfio_log_ops {
+	int (*log_start)(struct vfio_device *device,
+		struct rb_root_cached *ranges, u32 nnodes, u64 *page_size);
+	int (*log_stop)(struct vfio_device *device);
+	int (*log_read_and_clear)(struct vfio_device *device,
+		unsigned long iova, unsigned long length,
+		struct iova_bitmap *dirty);
+};
+
 /**
  * vfio_check_feature - Validate user input for the VFIO_DEVICE_FEATURE ioctl
  * @flags: Arg from the device_feature op
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 07/11] vfio/mlx5: Init QP based resources for dirty tracking
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (5 preceding siblings ...)
  2022-07-14  8:12 ` [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 08/11] vfio/mlx5: Create and destroy page tracker object Yishai Hadas
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Init QP based resources for dirty tracking to be used upon start
logging.

It includes:
Creating the host and firmware RC QPs, move each of them to its expected
state based on the device specification, etc.

Creating the relevant resources which are needed by both QPs as of UAR,
PD, etc.

Creating the host receive side resources as of MKEY, CQ, receive WQEs,
etc.

The above resources are cleaned-up upon stop logging.

The tracker object that will be introduced by next patches will use
those resources.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 595 +++++++++++++++++++++++++++++++++++-
 drivers/vfio/pci/mlx5/cmd.h |  53 ++++
 2 files changed, 636 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index dd5d7bfe0a49..0a362796d567 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -7,6 +7,8 @@
 
 static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
 				  u16 *vhca_id);
+static void
+_mlx5vf_free_page_tracker_resources(struct mlx5vf_pci_core_device *mvdev);
 
 int mlx5vf_cmd_suspend_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod)
 {
@@ -72,19 +74,22 @@ static int mlx5fv_vf_event(struct notifier_block *nb,
 	struct mlx5vf_pci_core_device *mvdev =
 		container_of(nb, struct mlx5vf_pci_core_device, nb);
 
-	mutex_lock(&mvdev->state_mutex);
 	switch (event) {
 	case MLX5_PF_NOTIFY_ENABLE_VF:
+		mutex_lock(&mvdev->state_mutex);
 		mvdev->mdev_detach = false;
+		mlx5vf_state_mutex_unlock(mvdev);
 		break;
 	case MLX5_PF_NOTIFY_DISABLE_VF:
-		mlx5vf_disable_fds(mvdev);
+		mlx5vf_cmd_close_migratable(mvdev);
+		mutex_lock(&mvdev->state_mutex);
 		mvdev->mdev_detach = true;
+		mlx5vf_state_mutex_unlock(mvdev);
 		break;
 	default:
 		break;
 	}
-	mlx5vf_state_mutex_unlock(mvdev);
+
 	return 0;
 }
 
@@ -95,6 +100,7 @@ void mlx5vf_cmd_close_migratable(struct mlx5vf_pci_core_device *mvdev)
 
 	mutex_lock(&mvdev->state_mutex);
 	mlx5vf_disable_fds(mvdev);
+	_mlx5vf_free_page_tracker_resources(mvdev);
 	mlx5vf_state_mutex_unlock(mvdev);
 }
 
@@ -188,11 +194,13 @@ static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
 	return ret;
 }
 
-static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
-			      struct mlx5_vf_migration_file *migf, u32 *mkey)
+static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
+			struct mlx5_vf_migration_file *migf,
+			struct mlx5_vhca_recv_buf *recv_buf,
+			u32 *mkey)
 {
-	size_t npages = DIV_ROUND_UP(migf->total_length, PAGE_SIZE);
-	struct sg_dma_page_iter dma_iter;
+	size_t npages = migf ? DIV_ROUND_UP(migf->total_length, PAGE_SIZE) :
+				recv_buf->npages;
 	int err = 0, inlen;
 	__be64 *mtt;
 	void *mkc;
@@ -209,8 +217,17 @@ static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 		 DIV_ROUND_UP(npages, 2));
 	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
 
-	for_each_sgtable_dma_page(&migf->table.sgt, &dma_iter, 0)
-		*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+	if (migf) {
+		struct sg_dma_page_iter dma_iter;
+
+		for_each_sgtable_dma_page(&migf->table.sgt, &dma_iter, 0)
+			*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+	} else {
+		int i;
+
+		for (i = 0; i < npages; i++)
+			*mtt++ = cpu_to_be64(recv_buf->dma_addrs[i]);
+	}
 
 	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
 	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
@@ -223,7 +240,8 @@ static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
 	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
 	MLX5_SET(mkc, mkc, translations_octword_size, DIV_ROUND_UP(npages, 2));
-	MLX5_SET64(mkc, mkc, len, migf->total_length);
+	MLX5_SET64(mkc, mkc, len,
+		   migf ? migf->total_length : (npages * PAGE_SIZE));
 	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
 	kvfree(in);
 	return err;
@@ -297,7 +315,7 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	if (err)
 		goto err_dma_map;
 
-	err = _create_state_mkey(mdev, pdn, migf, &mkey);
+	err = _create_mkey(mdev, pdn, migf, NULL, &mkey);
 	if (err)
 		goto err_create_mkey;
 
@@ -369,7 +387,7 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	if (err)
 		goto err_reg;
 
-	err = _create_state_mkey(mdev, pdn, migf, &mkey);
+	err = _create_mkey(mdev, pdn, migf, NULL, &mkey);
 	if (err)
 		goto err_mkey;
 
@@ -391,3 +409,556 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	mutex_unlock(&migf->lock);
 	return err;
 }
+
+static int alloc_cq_frag_buf(struct mlx5_core_dev *mdev,
+			     struct mlx5_vhca_cq_buf *buf, int nent,
+			     int cqe_size)
+{
+	struct mlx5_frag_buf *frag_buf = &buf->frag_buf;
+	u8 log_wq_stride = 6 + (cqe_size == 128 ? 1 : 0);
+	u8 log_wq_sz = ilog2(cqe_size);
+	int err;
+
+	err = mlx5_frag_buf_alloc_node(mdev, nent * cqe_size, frag_buf,
+				       mdev->priv.numa_node);
+	if (err)
+		return err;
+
+	mlx5_init_fbc(frag_buf->frags, log_wq_stride, log_wq_sz, &buf->fbc);
+	buf->cqe_size = cqe_size;
+	buf->nent = nent;
+	return 0;
+}
+
+static void init_cq_frag_buf(struct mlx5_vhca_cq_buf *buf)
+{
+	struct mlx5_cqe64 *cqe64;
+	void *cqe;
+	int i;
+
+	for (i = 0; i < buf->nent; i++) {
+		cqe = mlx5_frag_buf_get_wqe(&buf->fbc, i);
+		cqe64 = buf->cqe_size == 64 ? cqe : cqe + 64;
+		cqe64->op_own = MLX5_CQE_INVALID << 4;
+	}
+}
+
+static void mlx5vf_destroy_cq(struct mlx5_core_dev *mdev,
+			      struct mlx5_vhca_cq *cq)
+{
+	mlx5_core_destroy_cq(mdev, &cq->mcq);
+	mlx5_frag_buf_free(mdev, &cq->buf.frag_buf);
+	mlx5_db_free(mdev, &cq->db);
+}
+
+static int mlx5vf_create_cq(struct mlx5_core_dev *mdev,
+			    struct mlx5_vhca_page_tracker *tracker,
+			    size_t ncqe)
+{
+	int cqe_size = cache_line_size() == 128 ? 128 : 64;
+	u32 out[MLX5_ST_SZ_DW(create_cq_out)];
+	struct mlx5_vhca_cq *cq;
+	int inlen, err, eqn;
+	void *cqc, *in;
+	__be64 *pas;
+	int vector;
+
+	cq = &tracker->cq;
+	ncqe = roundup_pow_of_two(ncqe);
+	err = mlx5_db_alloc_node(mdev, &cq->db, mdev->priv.numa_node);
+	if (err)
+		return err;
+
+	cq->ncqe = ncqe;
+	cq->mcq.set_ci_db = cq->db.db;
+	cq->mcq.arm_db = cq->db.db + 1;
+	cq->mcq.cqe_sz = cqe_size;
+	err = alloc_cq_frag_buf(mdev, &cq->buf, ncqe, cqe_size);
+	if (err)
+		goto err_db_free;
+
+	init_cq_frag_buf(&cq->buf);
+	inlen = MLX5_ST_SZ_BYTES(create_cq_in) +
+		MLX5_FLD_SZ_BYTES(create_cq_in, pas[0]) *
+		cq->buf.frag_buf.npages;
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in) {
+		err = -ENOMEM;
+		goto err_buff;
+	}
+
+	vector = raw_smp_processor_id() % mlx5_comp_vectors_count(mdev);
+	err = mlx5_vector2eqn(mdev, vector, &eqn);
+	if (err)
+		goto err_vec;
+
+	cqc = MLX5_ADDR_OF(create_cq_in, in, cq_context);
+	MLX5_SET(cqc, cqc, log_cq_size, ilog2(ncqe));
+	MLX5_SET(cqc, cqc, c_eqn_or_apu_element, eqn);
+	MLX5_SET(cqc, cqc, uar_page, tracker->uar->index);
+	MLX5_SET(cqc, cqc, log_page_size, cq->buf.frag_buf.page_shift -
+		 MLX5_ADAPTER_PAGE_SHIFT);
+	MLX5_SET64(cqc, cqc, dbr_addr, cq->db.dma);
+	pas = (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas);
+	mlx5_fill_page_frag_array(&cq->buf.frag_buf, pas);
+	err = mlx5_core_create_cq(mdev, &cq->mcq, in, inlen, out, sizeof(out));
+	if (err)
+		goto err_vec;
+
+	kvfree(in);
+	return 0;
+
+err_vec:
+	kvfree(in);
+err_buff:
+	mlx5_frag_buf_free(mdev, &cq->buf.frag_buf);
+err_db_free:
+	mlx5_db_free(mdev, &cq->db);
+	return err;
+}
+
+static struct mlx5_vhca_qp *
+mlx5vf_create_rc_qp(struct mlx5_core_dev *mdev,
+		    struct mlx5_vhca_page_tracker *tracker, u32 max_recv_wr)
+{
+	u32 out[MLX5_ST_SZ_DW(create_qp_out)] = {};
+	struct mlx5_vhca_qp *qp;
+	u8 log_rq_stride;
+	u8 log_rq_sz;
+	void *qpc;
+	int inlen;
+	void *in;
+	int err;
+
+	qp = kzalloc(sizeof(*qp), GFP_KERNEL);
+	if (!qp)
+		return ERR_PTR(-ENOMEM);
+
+	qp->rq.wqe_cnt = roundup_pow_of_two(max_recv_wr);
+	log_rq_stride = ilog2(MLX5_SEND_WQE_DS);
+	log_rq_sz = ilog2(qp->rq.wqe_cnt);
+	err = mlx5_db_alloc_node(mdev, &qp->db, mdev->priv.numa_node);
+	if (err)
+		goto err_free;
+
+	if (max_recv_wr) {
+		err = mlx5_frag_buf_alloc_node(mdev,
+			wq_get_byte_sz(log_rq_sz, log_rq_stride),
+			&qp->buf, mdev->priv.numa_node);
+		if (err)
+			goto err_db_free;
+		mlx5_init_fbc(qp->buf.frags, log_rq_stride, log_rq_sz, &qp->rq.fbc);
+	}
+
+	qp->rq.db = &qp->db.db[MLX5_RCV_DBR];
+	inlen = MLX5_ST_SZ_BYTES(create_qp_in) +
+		MLX5_FLD_SZ_BYTES(create_qp_in, pas[0]) *
+		qp->buf.npages;
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in) {
+		err = -ENOMEM;
+		goto err_in;
+	}
+
+	qpc = MLX5_ADDR_OF(create_qp_in, in, qpc);
+	MLX5_SET(qpc, qpc, st, MLX5_QP_ST_RC);
+	MLX5_SET(qpc, qpc, pm_state, MLX5_QP_PM_MIGRATED);
+	MLX5_SET(qpc, qpc, pd, tracker->pdn);
+	MLX5_SET(qpc, qpc, uar_page, tracker->uar->index);
+	MLX5_SET(qpc, qpc, log_page_size,
+		 qp->buf.page_shift - MLX5_ADAPTER_PAGE_SHIFT);
+	MLX5_SET(qpc, qpc, ts_format, mlx5_get_qp_default_ts(mdev));
+	if (MLX5_CAP_GEN(mdev, cqe_version) == 1)
+		MLX5_SET(qpc, qpc, user_index, 0xFFFFFF);
+	MLX5_SET(qpc, qpc, no_sq, 1);
+	if (max_recv_wr) {
+		MLX5_SET(qpc, qpc, cqn_rcv, tracker->cq.mcq.cqn);
+		MLX5_SET(qpc, qpc, log_rq_stride, log_rq_stride - 4);
+		MLX5_SET(qpc, qpc, log_rq_size, log_rq_sz);
+		MLX5_SET(qpc, qpc, rq_type, MLX5_NON_ZERO_RQ);
+		MLX5_SET64(qpc, qpc, dbr_addr, qp->db.dma);
+		mlx5_fill_page_frag_array(&qp->buf,
+					  (__be64 *)MLX5_ADDR_OF(create_qp_in,
+								 in, pas));
+	} else {
+		MLX5_SET(qpc, qpc, rq_type, MLX5_ZERO_LEN_RQ);
+	}
+
+	MLX5_SET(create_qp_in, in, opcode, MLX5_CMD_OP_CREATE_QP);
+	err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+	kvfree(in);
+	if (err)
+		goto err_in;
+
+	qp->qpn = MLX5_GET(create_qp_out, out, qpn);
+	return qp;
+
+err_in:
+	if (max_recv_wr)
+		mlx5_frag_buf_free(mdev, &qp->buf);
+err_db_free:
+	mlx5_db_free(mdev, &qp->db);
+err_free:
+	kfree(qp);
+	return ERR_PTR(err);
+}
+
+static void mlx5vf_post_recv(struct mlx5_vhca_qp *qp)
+{
+	struct mlx5_wqe_data_seg *data;
+	unsigned int ix;
+
+	WARN_ON(qp->rq.pc - qp->rq.cc >= qp->rq.wqe_cnt);
+	ix = qp->rq.pc & (qp->rq.wqe_cnt - 1);
+	data = mlx5_frag_buf_get_wqe(&qp->rq.fbc, ix);
+	data->byte_count = cpu_to_be32(qp->max_msg_size);
+	data->lkey = cpu_to_be32(qp->recv_buf.mkey);
+	data->addr = cpu_to_be64(qp->recv_buf.next_rq_offset);
+	qp->rq.pc++;
+	/* Make sure that descriptors are written before doorbell record. */
+	dma_wmb();
+	*qp->rq.db = cpu_to_be32(qp->rq.pc & 0xffff);
+}
+
+static int mlx5vf_activate_qp(struct mlx5_core_dev *mdev,
+			      struct mlx5_vhca_qp *qp, u32 remote_qpn,
+			      bool host_qp)
+{
+	u32 init_in[MLX5_ST_SZ_DW(rst2init_qp_in)] = {};
+	u32 rtr_in[MLX5_ST_SZ_DW(init2rtr_qp_in)] = {};
+	u32 rts_in[MLX5_ST_SZ_DW(rtr2rts_qp_in)] = {};
+	void *qpc;
+	int ret;
+
+	/* Init */
+	qpc = MLX5_ADDR_OF(rst2init_qp_in, init_in, qpc);
+	MLX5_SET(qpc, qpc, primary_address_path.vhca_port_num, 1);
+	MLX5_SET(qpc, qpc, pm_state, MLX5_QPC_PM_STATE_MIGRATED);
+	MLX5_SET(qpc, qpc, rre, 1);
+	MLX5_SET(qpc, qpc, rwe, 1);
+	MLX5_SET(rst2init_qp_in, init_in, opcode, MLX5_CMD_OP_RST2INIT_QP);
+	MLX5_SET(rst2init_qp_in, init_in, qpn, qp->qpn);
+	ret = mlx5_cmd_exec_in(mdev, rst2init_qp, init_in);
+	if (ret)
+		return ret;
+
+	if (host_qp) {
+		struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
+		int i;
+
+		for (i = 0; i < qp->rq.wqe_cnt; i++) {
+			mlx5vf_post_recv(qp);
+			recv_buf->next_rq_offset += qp->max_msg_size;
+		}
+	}
+
+	/* RTR */
+	qpc = MLX5_ADDR_OF(init2rtr_qp_in, rtr_in, qpc);
+	MLX5_SET(init2rtr_qp_in, rtr_in, qpn, qp->qpn);
+	MLX5_SET(qpc, qpc, mtu, IB_MTU_4096);
+	MLX5_SET(qpc, qpc, log_msg_max, MLX5_CAP_GEN(mdev, log_max_msg));
+	MLX5_SET(qpc, qpc, remote_qpn, remote_qpn);
+	MLX5_SET(qpc, qpc, primary_address_path.vhca_port_num, 1);
+	MLX5_SET(qpc, qpc, primary_address_path.fl, 1);
+	MLX5_SET(qpc, qpc, min_rnr_nak, 1);
+	MLX5_SET(init2rtr_qp_in, rtr_in, opcode, MLX5_CMD_OP_INIT2RTR_QP);
+	MLX5_SET(init2rtr_qp_in, rtr_in, qpn, qp->qpn);
+	ret = mlx5_cmd_exec_in(mdev, init2rtr_qp, rtr_in);
+	if (ret || host_qp)
+		return ret;
+
+	/* RTS */
+	qpc = MLX5_ADDR_OF(rtr2rts_qp_in, rts_in, qpc);
+	MLX5_SET(rtr2rts_qp_in, rts_in, qpn, qp->qpn);
+	MLX5_SET(qpc, qpc, retry_count, 7);
+	MLX5_SET(qpc, qpc, rnr_retry, 7); /* Infinite retry if RNR NACK */
+	MLX5_SET(qpc, qpc, primary_address_path.ack_timeout, 0x8); /* ~1ms */
+	MLX5_SET(rtr2rts_qp_in, rts_in, opcode, MLX5_CMD_OP_RTR2RTS_QP);
+	MLX5_SET(rtr2rts_qp_in, rts_in, qpn, qp->qpn);
+
+	return mlx5_cmd_exec_in(mdev, rtr2rts_qp, rts_in);
+}
+
+static void mlx5vf_destroy_qp(struct mlx5_core_dev *mdev,
+			      struct mlx5_vhca_qp *qp)
+{
+	u32 in[MLX5_ST_SZ_DW(destroy_qp_in)] = {};
+
+	MLX5_SET(destroy_qp_in, in, opcode, MLX5_CMD_OP_DESTROY_QP);
+	MLX5_SET(destroy_qp_in, in, qpn, qp->qpn);
+	mlx5_cmd_exec_in(mdev, destroy_qp, in);
+
+	mlx5_frag_buf_free(mdev, &qp->buf);
+	mlx5_db_free(mdev, &qp->db);
+	kfree(qp);
+}
+
+static void free_recv_pages(struct mlx5_vhca_recv_buf *recv_buf)
+{
+	int i;
+
+	/* Undo alloc_pages_bulk_array() */
+	for (i = 0; i < recv_buf->npages; i++)
+		__free_page(recv_buf->page_list[i]);
+
+	kvfree(recv_buf->page_list);
+}
+
+static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
+			    unsigned int npages)
+{
+	unsigned int filled = 0, done = 0;
+	int i;
+
+	recv_buf->page_list = kvcalloc(npages, sizeof(*recv_buf->page_list),
+				       GFP_KERNEL);
+	if (!recv_buf->page_list)
+		return -ENOMEM;
+
+	for (;;) {
+		filled = alloc_pages_bulk_array(GFP_KERNEL, npages - done,
+						recv_buf->page_list + done);
+		if (!filled)
+			goto err;
+
+		done += filled;
+		if (done == npages)
+			break;
+	}
+
+	recv_buf->npages = npages;
+	return 0;
+
+err:
+	for (i = 0; i < npages; i++) {
+		if (recv_buf->page_list[i])
+			__free_page(recv_buf->page_list[i]);
+	}
+
+	kvfree(recv_buf->page_list);
+	return -ENOMEM;
+}
+
+static int register_dma_recv_pages(struct mlx5_core_dev *mdev,
+				   struct mlx5_vhca_recv_buf *recv_buf)
+{
+	int i, j;
+
+	recv_buf->dma_addrs = kvcalloc(recv_buf->npages,
+				       sizeof(*recv_buf->dma_addrs),
+				       GFP_KERNEL);
+	if (!recv_buf->dma_addrs)
+		return -ENOMEM;
+
+	for (i = 0; i < recv_buf->npages; i++) {
+		recv_buf->dma_addrs[i] = dma_map_page(mdev->device,
+						      recv_buf->page_list[i],
+						      0, PAGE_SIZE,
+						      DMA_FROM_DEVICE);
+		if (dma_mapping_error(mdev->device, recv_buf->dma_addrs[i]))
+			goto error;
+	}
+	return 0;
+
+error:
+	for (j = 0; j < i; j++)
+		dma_unmap_single(mdev->device, recv_buf->dma_addrs[j],
+				 PAGE_SIZE, DMA_FROM_DEVICE);
+
+	kvfree(recv_buf->dma_addrs);
+	return -ENOMEM;
+}
+
+static void unregister_dma_recv_pages(struct mlx5_core_dev *mdev,
+				      struct mlx5_vhca_recv_buf *recv_buf)
+{
+	int i;
+
+	for (i = 0; i < recv_buf->npages; i++)
+		dma_unmap_single(mdev->device, recv_buf->dma_addrs[i],
+				 PAGE_SIZE, DMA_FROM_DEVICE);
+
+	kvfree(recv_buf->dma_addrs);
+}
+
+static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
+					  struct mlx5_vhca_qp *qp)
+{
+	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
+
+	mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
+	unregister_dma_recv_pages(mdev, recv_buf);
+	free_recv_pages(&qp->recv_buf);
+}
+
+static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
+					  struct mlx5_vhca_qp *qp, u32 pdn,
+					  u64 rq_size)
+{
+	unsigned int npages = DIV_ROUND_UP_ULL(rq_size, PAGE_SIZE);
+	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
+	int err;
+
+	err = alloc_recv_pages(recv_buf, npages);
+	if (err < 0)
+		return err;
+
+	err = register_dma_recv_pages(mdev, recv_buf);
+	if (err)
+		goto end;
+
+	err = _create_mkey(mdev, pdn, NULL, recv_buf, &recv_buf->mkey);
+	if (err)
+		goto err_create_mkey;
+
+	return 0;
+
+err_create_mkey:
+	unregister_dma_recv_pages(mdev, recv_buf);
+end:
+	free_recv_pages(recv_buf);
+	return err;
+}
+
+static void
+_mlx5vf_free_page_tracker_resources(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5_vhca_page_tracker *tracker = &mvdev->tracker;
+	struct mlx5_core_dev *mdev = mvdev->mdev;
+
+	lockdep_assert_held(&mvdev->state_mutex);
+
+	if (!mvdev->log_active)
+		return;
+
+	WARN_ON(mvdev->mdev_detach);
+
+	mlx5vf_destroy_qp(mdev, tracker->fw_qp);
+	mlx5vf_free_qp_recv_resources(mdev, tracker->host_qp);
+	mlx5vf_destroy_qp(mdev, tracker->host_qp);
+	mlx5vf_destroy_cq(mdev, &tracker->cq);
+	mlx5_core_dealloc_pd(mdev, tracker->pdn);
+	mlx5_put_uars_page(mdev, tracker->uar);
+	mvdev->log_active = false;
+}
+
+int mlx5vf_stop_page_tracker(struct vfio_device *vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+
+	mutex_lock(&mvdev->state_mutex);
+	if (!mvdev->log_active)
+		goto end;
+
+	_mlx5vf_free_page_tracker_resources(mvdev);
+	mvdev->log_active = false;
+end:
+	mlx5vf_state_mutex_unlock(mvdev);
+	return 0;
+}
+
+int mlx5vf_start_page_tracker(struct vfio_device *vdev,
+			      struct rb_root_cached *ranges, u32 nnodes,
+			      u64 *page_size)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	struct mlx5_vhca_page_tracker *tracker = &mvdev->tracker;
+	u8 log_tracked_page = ilog2(*page_size);
+	struct mlx5_vhca_qp *host_qp;
+	struct mlx5_vhca_qp *fw_qp;
+	struct mlx5_core_dev *mdev;
+	u32 max_msg_size = PAGE_SIZE;
+	u64 rq_size = SZ_2M;
+	u32 max_recv_wr;
+	int err;
+
+	mutex_lock(&mvdev->state_mutex);
+	if (mvdev->mdev_detach) {
+		err = -ENOTCONN;
+		goto end;
+	}
+
+	if (mvdev->log_active) {
+		err = -EINVAL;
+		goto end;
+	}
+
+	mdev = mvdev->mdev;
+	memset(tracker, 0, sizeof(*tracker));
+	tracker->uar = mlx5_get_uars_page(mdev);
+	if (IS_ERR(tracker->uar)) {
+		err = PTR_ERR(tracker->uar);
+		goto end;
+	}
+
+	err = mlx5_core_alloc_pd(mdev, &tracker->pdn);
+	if (err)
+		goto err_uar;
+
+	max_recv_wr = DIV_ROUND_UP_ULL(rq_size, max_msg_size);
+	err = mlx5vf_create_cq(mdev, tracker, max_recv_wr);
+	if (err)
+		goto err_dealloc_pd;
+
+	host_qp = mlx5vf_create_rc_qp(mdev, tracker, max_recv_wr);
+	if (IS_ERR(host_qp)) {
+		err = PTR_ERR(host_qp);
+		goto err_cq;
+	}
+
+	host_qp->max_msg_size = max_msg_size;
+	if (log_tracked_page < MLX5_CAP_ADV_VIRTUALIZATION(mdev,
+				pg_track_log_min_page_size)) {
+		log_tracked_page = MLX5_CAP_ADV_VIRTUALIZATION(mdev,
+				pg_track_log_min_page_size);
+	} else if (log_tracked_page > MLX5_CAP_ADV_VIRTUALIZATION(mdev,
+				pg_track_log_max_page_size)) {
+		log_tracked_page = MLX5_CAP_ADV_VIRTUALIZATION(mdev,
+				pg_track_log_max_page_size);
+	}
+
+	host_qp->tracked_page_size = (1ULL << log_tracked_page);
+	err = mlx5vf_alloc_qp_recv_resources(mdev, host_qp, tracker->pdn,
+					     rq_size);
+	if (err)
+		goto err_host_qp;
+
+	fw_qp = mlx5vf_create_rc_qp(mdev, tracker, 0);
+	if (IS_ERR(fw_qp)) {
+		err = PTR_ERR(fw_qp);
+		goto err_recv_resources;
+	}
+
+	err = mlx5vf_activate_qp(mdev, host_qp, fw_qp->qpn, true);
+	if (err)
+		goto err_activate;
+
+	err = mlx5vf_activate_qp(mdev, fw_qp, host_qp->qpn, false);
+	if (err)
+		goto err_activate;
+
+	tracker->host_qp = host_qp;
+	tracker->fw_qp = fw_qp;
+	*page_size = host_qp->tracked_page_size;
+	mvdev->log_active = true;
+	mlx5vf_state_mutex_unlock(mvdev);
+	return 0;
+
+err_activate:
+	mlx5vf_destroy_qp(mdev, fw_qp);
+err_recv_resources:
+	mlx5vf_free_qp_recv_resources(mdev, host_qp);
+err_host_qp:
+	mlx5vf_destroy_qp(mdev, host_qp);
+err_cq:
+	mlx5vf_destroy_cq(mdev, &tracker->cq);
+err_dealloc_pd:
+	mlx5_core_dealloc_pd(mdev, tracker->pdn);
+err_uar:
+	mlx5_put_uars_page(mdev, tracker->uar);
+end:
+	mlx5vf_state_mutex_unlock(mvdev);
+	return err;
+}
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 8208f4701a90..e71ec017bf04 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -9,6 +9,8 @@
 #include <linux/kernel.h>
 #include <linux/vfio_pci_core.h>
 #include <linux/mlx5/driver.h>
+#include <linux/mlx5/cq.h>
+#include <linux/mlx5/qp.h>
 
 struct mlx5vf_async_data {
 	struct mlx5_async_work cb_work;
@@ -39,6 +41,52 @@ struct mlx5_vf_migration_file {
 	struct mlx5vf_async_data async_data;
 };
 
+struct mlx5_vhca_cq_buf {
+	struct mlx5_frag_buf_ctrl fbc;
+	struct mlx5_frag_buf frag_buf;
+	int cqe_size;
+	int nent;
+};
+
+struct mlx5_vhca_cq {
+	struct mlx5_vhca_cq_buf buf;
+	struct mlx5_db db;
+	struct mlx5_core_cq mcq;
+	size_t ncqe;
+};
+
+struct mlx5_vhca_recv_buf {
+	u32 npages;
+	struct page **page_list;
+	dma_addr_t *dma_addrs;
+	u32 next_rq_offset;
+	u32 mkey;
+};
+
+struct mlx5_vhca_qp {
+	struct mlx5_frag_buf buf;
+	struct mlx5_db db;
+	struct mlx5_vhca_recv_buf recv_buf;
+	u32 tracked_page_size;
+	u32 max_msg_size;
+	u32 qpn;
+	struct {
+		unsigned int pc;
+		unsigned int cc;
+		unsigned int wqe_cnt;
+		__be32 *db;
+		struct mlx5_frag_buf_ctrl fbc;
+	} rq;
+};
+
+struct mlx5_vhca_page_tracker {
+	u32 pdn;
+	struct mlx5_uars_page *uar;
+	struct mlx5_vhca_cq cq;
+	struct mlx5_vhca_qp *host_qp;
+	struct mlx5_vhca_qp *fw_qp;
+};
+
 struct mlx5vf_pci_core_device {
 	struct vfio_pci_core_device core_device;
 	int vf_id;
@@ -46,6 +94,7 @@ struct mlx5vf_pci_core_device {
 	u8 migrate_cap:1;
 	u8 deferred_reset:1;
 	u8 mdev_detach:1;
+	u8 log_active:1;
 	/* protect migration state */
 	struct mutex state_mutex;
 	enum vfio_device_mig_state mig_state;
@@ -53,6 +102,7 @@ struct mlx5vf_pci_core_device {
 	spinlock_t reset_lock;
 	struct mlx5_vf_migration_file *resuming_migf;
 	struct mlx5_vf_migration_file *saving_migf;
+	struct mlx5_vhca_page_tracker tracker;
 	struct workqueue_struct *cb_wq;
 	struct notifier_block nb;
 	struct mlx5_core_dev *mdev;
@@ -73,4 +123,7 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev);
 void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev);
 void mlx5vf_mig_file_cleanup_cb(struct work_struct *_work);
+int mlx5vf_start_page_tracker(struct vfio_device *vdev,
+		struct rb_root_cached *ranges, u32 nnodes, u64 *page_size);
+int mlx5vf_stop_page_tracker(struct vfio_device *vdev);
 #endif /* MLX5_VFIO_CMD_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 08/11] vfio/mlx5: Create and destroy page tracker object
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (6 preceding siblings ...)
  2022-07-14  8:12 ` [PATCH V2 vfio 07/11] vfio/mlx5: Init QP based resources for dirty tracking Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 09/11] vfio/mlx5: Report dirty pages from tracker Yishai Hadas
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Add support for creating and destroying page tracker object.

This object is used to control/report the device dirty pages.

As part of creating the tracker need to consider the device capabilities
for max ranges and adapt/combine ranges accordingly.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 147 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/mlx5/cmd.h |   1 +
 2 files changed, 148 insertions(+)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 0a362796d567..f1cad96af6ab 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -410,6 +410,148 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	return err;
 }
 
+static void combine_ranges(struct rb_root_cached *root, u32 cur_nodes,
+			   u32 req_nodes)
+{
+	struct interval_tree_node *prev, *curr, *comb_start, *comb_end;
+	unsigned long min_gap;
+	unsigned long curr_gap;
+
+	/* Special shortcut when a single range is required */
+	if (req_nodes == 1) {
+		unsigned long last;
+
+		curr = comb_start = interval_tree_iter_first(root, 0, ULONG_MAX);
+		while (curr) {
+			last = curr->last;
+			prev = curr;
+			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
+			if (prev != comb_start)
+				interval_tree_remove(prev, root);
+		}
+		comb_start->last = last;
+		return;
+	}
+
+	/* Combine ranges which have the smallest gap */
+	while (cur_nodes > req_nodes) {
+		prev = NULL;
+		min_gap = ULONG_MAX;
+		curr = interval_tree_iter_first(root, 0, ULONG_MAX);
+		while (curr) {
+			if (prev) {
+				curr_gap = curr->start - prev->last;
+				if (curr_gap < min_gap) {
+					min_gap = curr_gap;
+					comb_start = prev;
+					comb_end = curr;
+				}
+			}
+			prev = curr;
+			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
+		}
+		comb_start->last = comb_end->last;
+		interval_tree_remove(comb_end, root);
+		cur_nodes--;
+	}
+}
+
+static int mlx5vf_create_tracker(struct mlx5_core_dev *mdev,
+				 struct mlx5vf_pci_core_device *mvdev,
+				 struct rb_root_cached *ranges, u32 nnodes)
+{
+	int max_num_range =
+		MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_max_num_range);
+	struct mlx5_vhca_page_tracker *tracker = &mvdev->tracker;
+	int record_size = MLX5_ST_SZ_BYTES(page_track_range);
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)] = {};
+	struct interval_tree_node *node = NULL;
+	u64 total_ranges_len = 0;
+	u32 num_ranges = nnodes;
+	u8 log_addr_space_size;
+	void *range_list_ptr;
+	void *obj_context;
+	void *cmd_hdr;
+	int inlen;
+	void *in;
+	int err;
+	int i;
+
+	if (num_ranges > max_num_range) {
+		combine_ranges(ranges, nnodes, max_num_range);
+		num_ranges = max_num_range;
+	}
+
+	inlen = MLX5_ST_SZ_BYTES(create_page_track_obj_in) +
+				 record_size * num_ranges;
+	in = kzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	cmd_hdr = MLX5_ADDR_OF(create_page_track_obj_in, in,
+			       general_obj_in_cmd_hdr);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, opcode,
+		 MLX5_CMD_OP_CREATE_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, obj_type,
+		 MLX5_OBJ_TYPE_PAGE_TRACK);
+	obj_context = MLX5_ADDR_OF(create_page_track_obj_in, in, obj_context);
+	MLX5_SET(page_track, obj_context, vhca_id, mvdev->vhca_id);
+	MLX5_SET(page_track, obj_context, track_type, 1);
+	MLX5_SET(page_track, obj_context, log_page_size,
+		 ilog2(tracker->host_qp->tracked_page_size));
+	MLX5_SET(page_track, obj_context, log_msg_size,
+		 ilog2(tracker->host_qp->max_msg_size));
+	MLX5_SET(page_track, obj_context, reporting_qpn, tracker->fw_qp->qpn);
+	MLX5_SET(page_track, obj_context, num_ranges, num_ranges);
+
+	range_list_ptr = MLX5_ADDR_OF(page_track, obj_context, track_range);
+	node = interval_tree_iter_first(ranges, 0, ULONG_MAX);
+	for (i = 0; i < num_ranges; i++) {
+		void *addr_range_i_base = range_list_ptr + record_size * i;
+		unsigned long length = node->last - node->start;
+
+		MLX5_SET64(page_track_range, addr_range_i_base, start_address,
+			   node->start);
+		MLX5_SET64(page_track_range, addr_range_i_base, length, length);
+		total_ranges_len += length;
+		node = interval_tree_iter_next(node, 0, ULONG_MAX);
+	}
+
+	WARN_ON(node);
+	log_addr_space_size = ilog2(total_ranges_len);
+	if (log_addr_space_size <
+	    (MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_log_min_addr_space)) ||
+	    log_addr_space_size >
+	    (MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_log_max_addr_space))) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+
+	MLX5_SET(page_track, obj_context, log_addr_space_size,
+		 log_addr_space_size);
+	err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+	if (err)
+		goto out;
+
+	tracker->id = MLX5_GET(general_obj_out_cmd_hdr, out, obj_id);
+out:
+	kfree(in);
+	return err;
+}
+
+static int mlx5vf_cmd_destroy_tracker(struct mlx5_core_dev *mdev,
+				      u32 tracker_id)
+{
+	u32 in[MLX5_ST_SZ_DW(general_obj_in_cmd_hdr)] = {};
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)] = {};
+
+	MLX5_SET(general_obj_in_cmd_hdr, in, opcode, MLX5_CMD_OP_DESTROY_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_type, MLX5_OBJ_TYPE_PAGE_TRACK);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_id, tracker_id);
+
+	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
 static int alloc_cq_frag_buf(struct mlx5_core_dev *mdev,
 			     struct mlx5_vhca_cq_buf *buf, int nent,
 			     int cqe_size)
@@ -833,6 +975,7 @@ _mlx5vf_free_page_tracker_resources(struct mlx5vf_pci_core_device *mvdev)
 
 	WARN_ON(mvdev->mdev_detach);
 
+	mlx5vf_cmd_destroy_tracker(mdev, tracker->id);
 	mlx5vf_destroy_qp(mdev, tracker->fw_qp);
 	mlx5vf_free_qp_recv_resources(mdev, tracker->host_qp);
 	mlx5vf_destroy_qp(mdev, tracker->host_qp);
@@ -941,6 +1084,10 @@ int mlx5vf_start_page_tracker(struct vfio_device *vdev,
 
 	tracker->host_qp = host_qp;
 	tracker->fw_qp = fw_qp;
+	err = mlx5vf_create_tracker(mdev, mvdev, ranges, nnodes);
+	if (err)
+		goto err_activate;
+
 	*page_size = host_qp->tracked_page_size;
 	mvdev->log_active = true;
 	mlx5vf_state_mutex_unlock(mvdev);
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index e71ec017bf04..658925ba5459 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -80,6 +80,7 @@ struct mlx5_vhca_qp {
 };
 
 struct mlx5_vhca_page_tracker {
+	u32 id;
 	u32 pdn;
 	struct mlx5_uars_page *uar;
 	struct mlx5_vhca_cq cq;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 09/11] vfio/mlx5: Report dirty pages from tracker
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (7 preceding siblings ...)
  2022-07-14  8:12 ` [PATCH V2 vfio 08/11] vfio/mlx5: Create and destroy page tracker object Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 10/11] vfio/mlx5: Manage error scenarios on tracker Yishai Hadas
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Report dirty pages from tracker.

It includes:
Querying for dirty pages in a given IOVA range, this is done by
modifying the tracker into the reporting state and supplying the
required range.

Using the CQ event completion mechanism to be notified once data is
ready on the CQ/QP to be processed.

Once data is available turn on the corresponding bits in the bit map.

This functionality will be used as part of the 'log_read_and_clear'
driver callback in the next patches.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 191 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/mlx5/cmd.h |   4 +
 2 files changed, 195 insertions(+)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index f1cad96af6ab..fa9ddd926500 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -5,6 +5,8 @@
 
 #include "cmd.h"
 
+enum { CQ_OK = 0, CQ_EMPTY = -1, CQ_POLL_ERR = -2 };
+
 static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
 				  u16 *vhca_id);
 static void
@@ -157,6 +159,7 @@ void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
 		VFIO_MIGRATION_STOP_COPY |
 		VFIO_MIGRATION_P2P;
 	mvdev->core_device.vdev.mig_ops = mig_ops;
+	init_completion(&mvdev->tracker_comp);
 
 end:
 	mlx5_vf_put_core_dev(mvdev->mdev);
@@ -552,6 +555,29 @@ static int mlx5vf_cmd_destroy_tracker(struct mlx5_core_dev *mdev,
 	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
 }
 
+static int mlx5vf_cmd_modify_tracker(struct mlx5_core_dev *mdev,
+				     u32 tracker_id, unsigned long iova,
+				     unsigned long length, u32 tracker_state)
+{
+	u32 in[MLX5_ST_SZ_DW(modify_page_track_obj_in)] = {};
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)] = {};
+	void *obj_context;
+	void *cmd_hdr;
+
+	cmd_hdr = MLX5_ADDR_OF(modify_page_track_obj_in, in, general_obj_in_cmd_hdr);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, opcode, MLX5_CMD_OP_MODIFY_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, obj_type, MLX5_OBJ_TYPE_PAGE_TRACK);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, obj_id, tracker_id);
+
+	obj_context = MLX5_ADDR_OF(modify_page_track_obj_in, in, obj_context);
+	MLX5_SET64(page_track, obj_context, modify_field_select, 0x3);
+	MLX5_SET64(page_track, obj_context, range_start_address, iova);
+	MLX5_SET64(page_track, obj_context, length, length);
+	MLX5_SET(page_track, obj_context, state, tracker_state);
+
+	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
 static int alloc_cq_frag_buf(struct mlx5_core_dev *mdev,
 			     struct mlx5_vhca_cq_buf *buf, int nent,
 			     int cqe_size)
@@ -593,6 +619,16 @@ static void mlx5vf_destroy_cq(struct mlx5_core_dev *mdev,
 	mlx5_db_free(mdev, &cq->db);
 }
 
+static void mlx5vf_cq_complete(struct mlx5_core_cq *mcq,
+			       struct mlx5_eqe *eqe)
+{
+	struct mlx5vf_pci_core_device *mvdev =
+		container_of(mcq, struct mlx5vf_pci_core_device,
+			     tracker.cq.mcq);
+
+	complete(&mvdev->tracker_comp);
+}
+
 static int mlx5vf_create_cq(struct mlx5_core_dev *mdev,
 			    struct mlx5_vhca_page_tracker *tracker,
 			    size_t ncqe)
@@ -643,10 +679,13 @@ static int mlx5vf_create_cq(struct mlx5_core_dev *mdev,
 	MLX5_SET64(cqc, cqc, dbr_addr, cq->db.dma);
 	pas = (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas);
 	mlx5_fill_page_frag_array(&cq->buf.frag_buf, pas);
+	cq->mcq.comp = mlx5vf_cq_complete;
 	err = mlx5_core_create_cq(mdev, &cq->mcq, in, inlen, out, sizeof(out));
 	if (err)
 		goto err_vec;
 
+	mlx5_cq_arm(&cq->mcq, MLX5_CQ_DB_REQ_NOT, tracker->uar->map,
+		    cq->mcq.cons_index);
 	kvfree(in);
 	return 0;
 
@@ -1109,3 +1148,155 @@ int mlx5vf_start_page_tracker(struct vfio_device *vdev,
 	mlx5vf_state_mutex_unlock(mvdev);
 	return err;
 }
+
+static void
+set_report_output(u32 size, int index, struct mlx5_vhca_qp *qp,
+		  struct iova_bitmap *dirty)
+{
+	u32 entry_size = MLX5_ST_SZ_BYTES(page_track_report_entry);
+	u32 nent = size / entry_size;
+	struct page *page;
+	u64 addr;
+	u64 *buf;
+	int i;
+
+	if (WARN_ON(index >= qp->recv_buf.npages ||
+		    (nent > qp->max_msg_size / entry_size)))
+		return;
+
+	page = qp->recv_buf.page_list[index];
+	buf = kmap_local_page(page);
+	for (i = 0; i < nent; i++) {
+		addr = MLX5_GET(page_track_report_entry, buf + i,
+				dirty_address_low);
+		addr |= (u64)MLX5_GET(page_track_report_entry, buf + i,
+				      dirty_address_high) << 32;
+		iova_bitmap_set(dirty, addr, qp->tracked_page_size);
+	}
+	kunmap_local(buf);
+}
+
+static void
+mlx5vf_rq_cqe(struct mlx5_vhca_qp *qp, struct mlx5_cqe64 *cqe,
+	      struct iova_bitmap *dirty, int *tracker_status)
+{
+	u32 size;
+	int ix;
+
+	qp->rq.cc++;
+	*tracker_status = be32_to_cpu(cqe->immediate) >> 28;
+	size = be32_to_cpu(cqe->byte_cnt);
+	ix = be16_to_cpu(cqe->wqe_counter) & (qp->rq.wqe_cnt - 1);
+
+	/* zero length CQE, no data */
+	WARN_ON(!size && *tracker_status == MLX5_PAGE_TRACK_STATE_REPORTING);
+	if (size)
+		set_report_output(size, ix, qp, dirty);
+
+	qp->recv_buf.next_rq_offset = ix * qp->max_msg_size;
+	mlx5vf_post_recv(qp);
+}
+
+static void *get_cqe(struct mlx5_vhca_cq *cq, int n)
+{
+	return mlx5_frag_buf_get_wqe(&cq->buf.fbc, n);
+}
+
+static struct mlx5_cqe64 *get_sw_cqe(struct mlx5_vhca_cq *cq, int n)
+{
+	void *cqe = get_cqe(cq, n & (cq->ncqe - 1));
+	struct mlx5_cqe64 *cqe64;
+
+	cqe64 = (cq->mcq.cqe_sz == 64) ? cqe : cqe + 64;
+
+	if (likely(get_cqe_opcode(cqe64) != MLX5_CQE_INVALID) &&
+	    !((cqe64->op_own & MLX5_CQE_OWNER_MASK) ^ !!(n & (cq->ncqe)))) {
+		return cqe64;
+	} else {
+		return NULL;
+	}
+}
+
+static int
+mlx5vf_cq_poll_one(struct mlx5_vhca_cq *cq, struct mlx5_vhca_qp *qp,
+		   struct iova_bitmap *dirty, int *tracker_status)
+{
+	struct mlx5_cqe64 *cqe;
+	u8 opcode;
+
+	cqe = get_sw_cqe(cq, cq->mcq.cons_index);
+	if (!cqe)
+		return CQ_EMPTY;
+
+	++cq->mcq.cons_index;
+	/*
+	 * Make sure we read CQ entry contents after we've checked the
+	 * ownership bit.
+	 */
+	rmb();
+	opcode = get_cqe_opcode(cqe);
+	switch (opcode) {
+	case MLX5_CQE_RESP_SEND_IMM:
+		mlx5vf_rq_cqe(qp, cqe, dirty, tracker_status);
+		return CQ_OK;
+	default:
+		return CQ_POLL_ERR;
+	}
+}
+
+int mlx5vf_tracker_read_and_clear(struct vfio_device *vdev, unsigned long iova,
+				  unsigned long length,
+				  struct iova_bitmap *dirty)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	struct mlx5_vhca_page_tracker *tracker = &mvdev->tracker;
+	struct mlx5_vhca_cq *cq = &tracker->cq;
+	struct mlx5_core_dev *mdev;
+	int poll_err, err;
+
+	mutex_lock(&mvdev->state_mutex);
+	if (!mvdev->log_active) {
+		err = -EINVAL;
+		goto end;
+	}
+
+	if (mvdev->mdev_detach) {
+		err = -ENOTCONN;
+		goto end;
+	}
+
+	mdev = mvdev->mdev;
+	err = mlx5vf_cmd_modify_tracker(mdev, tracker->id, iova, length,
+					MLX5_PAGE_TRACK_STATE_REPORTING);
+	if (err)
+		goto end;
+
+	tracker->status = MLX5_PAGE_TRACK_STATE_REPORTING;
+	while (tracker->status == MLX5_PAGE_TRACK_STATE_REPORTING) {
+		poll_err = mlx5vf_cq_poll_one(cq, tracker->host_qp, dirty,
+					      &tracker->status);
+		if (poll_err == CQ_EMPTY) {
+			mlx5_cq_arm(&cq->mcq, MLX5_CQ_DB_REQ_NOT, tracker->uar->map,
+				    cq->mcq.cons_index);
+			poll_err = mlx5vf_cq_poll_one(cq, tracker->host_qp,
+						      dirty, &tracker->status);
+			if (poll_err == CQ_EMPTY) {
+				wait_for_completion(&mvdev->tracker_comp);
+				continue;
+			}
+		}
+		if (poll_err == CQ_POLL_ERR) {
+			err = -EIO;
+			goto end;
+		}
+		mlx5_cq_set_ci(&cq->mcq);
+	}
+
+	if (tracker->status == MLX5_PAGE_TRACK_STATE_ERROR)
+		err = -EIO;
+
+end:
+	mlx5vf_state_mutex_unlock(mvdev);
+	return err;
+}
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 658925ba5459..fa1f9ab4d3d0 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -86,6 +86,7 @@ struct mlx5_vhca_page_tracker {
 	struct mlx5_vhca_cq cq;
 	struct mlx5_vhca_qp *host_qp;
 	struct mlx5_vhca_qp *fw_qp;
+	int status;
 };
 
 struct mlx5vf_pci_core_device {
@@ -96,6 +97,7 @@ struct mlx5vf_pci_core_device {
 	u8 deferred_reset:1;
 	u8 mdev_detach:1;
 	u8 log_active:1;
+	struct completion tracker_comp;
 	/* protect migration state */
 	struct mutex state_mutex;
 	enum vfio_device_mig_state mig_state;
@@ -127,4 +129,6 @@ void mlx5vf_mig_file_cleanup_cb(struct work_struct *_work);
 int mlx5vf_start_page_tracker(struct vfio_device *vdev,
 		struct rb_root_cached *ranges, u32 nnodes, u64 *page_size);
 int mlx5vf_stop_page_tracker(struct vfio_device *vdev);
+int mlx5vf_tracker_read_and_clear(struct vfio_device *vdev, unsigned long iova,
+			unsigned long length, struct iova_bitmap *dirty);
 #endif /* MLX5_VFIO_CMD_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 10/11] vfio/mlx5: Manage error scenarios on tracker
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (8 preceding siblings ...)
  2022-07-14  8:12 ` [PATCH V2 vfio 09/11] vfio/mlx5: Report dirty pages from tracker Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-14  8:12 ` [PATCH V2 vfio 11/11] vfio/mlx5: Set the driver DMA logging callbacks Yishai Hadas
  2022-07-21  8:26 ` [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Tian, Kevin
  11 siblings, 0 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Handle async error events and health/recovery flow to safely stop the
tracker upon error scenarios.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 61 +++++++++++++++++++++++++++++++++++--
 drivers/vfio/pci/mlx5/cmd.h |  2 ++
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index fa9ddd926500..3e92b4d92be2 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -70,6 +70,13 @@ int mlx5vf_cmd_query_vhca_migration_state(struct mlx5vf_pci_core_device *mvdev,
 	return 0;
 }
 
+static void set_tracker_error(struct mlx5vf_pci_core_device *mvdev)
+{
+	/* Mark the tracker under an error and wake it up if it's running */
+	mvdev->tracker.is_err = true;
+	complete(&mvdev->tracker_comp);
+}
+
 static int mlx5fv_vf_event(struct notifier_block *nb,
 			   unsigned long event, void *data)
 {
@@ -100,6 +107,8 @@ void mlx5vf_cmd_close_migratable(struct mlx5vf_pci_core_device *mvdev)
 	if (!mvdev->migrate_cap)
 		return;
 
+	/* Must be done outside the lock to let it progress */
+	set_tracker_error(mvdev);
 	mutex_lock(&mvdev->state_mutex);
 	mlx5vf_disable_fds(mvdev);
 	_mlx5vf_free_page_tracker_resources(mvdev);
@@ -619,6 +628,47 @@ static void mlx5vf_destroy_cq(struct mlx5_core_dev *mdev,
 	mlx5_db_free(mdev, &cq->db);
 }
 
+static void mlx5vf_cq_event(struct mlx5_core_cq *mcq, enum mlx5_event type)
+{
+	if (type != MLX5_EVENT_TYPE_CQ_ERROR)
+		return;
+
+	set_tracker_error(container_of(mcq, struct mlx5vf_pci_core_device,
+				       tracker.cq.mcq));
+}
+
+static int mlx5vf_event_notifier(struct notifier_block *nb, unsigned long type,
+				 void *data)
+{
+	struct mlx5_vhca_page_tracker *tracker =
+		mlx5_nb_cof(nb, struct mlx5_vhca_page_tracker, nb);
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		tracker, struct mlx5vf_pci_core_device, tracker);
+	struct mlx5_eqe *eqe = data;
+	u8 event_type = (u8)type;
+	u8 queue_type;
+	int qp_num;
+
+	switch (event_type) {
+	case MLX5_EVENT_TYPE_WQ_CATAS_ERROR:
+	case MLX5_EVENT_TYPE_WQ_ACCESS_ERROR:
+	case MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
+		queue_type = eqe->data.qp_srq.type;
+		if (queue_type != MLX5_EVENT_QUEUE_TYPE_QP)
+			break;
+		qp_num = be32_to_cpu(eqe->data.qp_srq.qp_srq_n) & 0xffffff;
+		if (qp_num != tracker->host_qp->qpn &&
+		    qp_num != tracker->fw_qp->qpn)
+			break;
+		set_tracker_error(mvdev);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
 static void mlx5vf_cq_complete(struct mlx5_core_cq *mcq,
 			       struct mlx5_eqe *eqe)
 {
@@ -680,6 +730,7 @@ static int mlx5vf_create_cq(struct mlx5_core_dev *mdev,
 	pas = (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas);
 	mlx5_fill_page_frag_array(&cq->buf.frag_buf, pas);
 	cq->mcq.comp = mlx5vf_cq_complete;
+	cq->mcq.event = mlx5vf_cq_event;
 	err = mlx5_core_create_cq(mdev, &cq->mcq, in, inlen, out, sizeof(out));
 	if (err)
 		goto err_vec;
@@ -1014,6 +1065,7 @@ _mlx5vf_free_page_tracker_resources(struct mlx5vf_pci_core_device *mvdev)
 
 	WARN_ON(mvdev->mdev_detach);
 
+	mlx5_eq_notifier_unregister(mdev, &tracker->nb);
 	mlx5vf_cmd_destroy_tracker(mdev, tracker->id);
 	mlx5vf_destroy_qp(mdev, tracker->fw_qp);
 	mlx5vf_free_qp_recv_resources(mdev, tracker->host_qp);
@@ -1127,6 +1179,8 @@ int mlx5vf_start_page_tracker(struct vfio_device *vdev,
 	if (err)
 		goto err_activate;
 
+	MLX5_NB_INIT(&tracker->nb, mlx5vf_event_notifier, NOTIFY_ANY);
+	mlx5_eq_notifier_register(mdev, &tracker->nb);
 	*page_size = host_qp->tracked_page_size;
 	mvdev->log_active = true;
 	mlx5vf_state_mutex_unlock(mvdev);
@@ -1273,7 +1327,8 @@ int mlx5vf_tracker_read_and_clear(struct vfio_device *vdev, unsigned long iova,
 		goto end;
 
 	tracker->status = MLX5_PAGE_TRACK_STATE_REPORTING;
-	while (tracker->status == MLX5_PAGE_TRACK_STATE_REPORTING) {
+	while (tracker->status == MLX5_PAGE_TRACK_STATE_REPORTING &&
+	       !tracker->is_err) {
 		poll_err = mlx5vf_cq_poll_one(cq, tracker->host_qp, dirty,
 					      &tracker->status);
 		if (poll_err == CQ_EMPTY) {
@@ -1294,8 +1349,10 @@ int mlx5vf_tracker_read_and_clear(struct vfio_device *vdev, unsigned long iova,
 	}
 
 	if (tracker->status == MLX5_PAGE_TRACK_STATE_ERROR)
-		err = -EIO;
+		tracker->is_err = true;
 
+	if (tracker->is_err)
+		err = -EIO;
 end:
 	mlx5vf_state_mutex_unlock(mvdev);
 	return err;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index fa1f9ab4d3d0..8b0ae40c620c 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -82,10 +82,12 @@ struct mlx5_vhca_qp {
 struct mlx5_vhca_page_tracker {
 	u32 id;
 	u32 pdn;
+	u8 is_err:1;
 	struct mlx5_uars_page *uar;
 	struct mlx5_vhca_cq cq;
 	struct mlx5_vhca_qp *host_qp;
 	struct mlx5_vhca_qp *fw_qp;
+	struct mlx5_nb nb;
 	int status;
 };
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH V2 vfio 11/11] vfio/mlx5: Set the driver DMA logging callbacks
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (9 preceding siblings ...)
  2022-07-14  8:12 ` [PATCH V2 vfio 10/11] vfio/mlx5: Manage error scenarios on tracker Yishai Hadas
@ 2022-07-14  8:12 ` Yishai Hadas
  2022-07-21  8:26 ` [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Tian, Kevin
  11 siblings, 0 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-14  8:12 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Now that everything is ready set the driver DMA logging callbacks if
supported by the device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 5 ++++-
 drivers/vfio/pci/mlx5/cmd.h  | 3 ++-
 drivers/vfio/pci/mlx5/main.c | 9 ++++++++-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 3e92b4d92be2..c604b70437a5 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -126,7 +126,8 @@ void mlx5vf_cmd_remove_migratable(struct mlx5vf_pci_core_device *mvdev)
 }
 
 void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
-			       const struct vfio_migration_ops *mig_ops)
+			       const struct vfio_migration_ops *mig_ops,
+			       const struct vfio_log_ops *log_ops)
 {
 	struct pci_dev *pdev = mvdev->core_device.pdev;
 	int ret;
@@ -169,6 +170,8 @@ void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
 		VFIO_MIGRATION_P2P;
 	mvdev->core_device.vdev.mig_ops = mig_ops;
 	init_completion(&mvdev->tracker_comp);
+	if (MLX5_CAP_GEN(mvdev->mdev, adv_virtualization))
+		mvdev->core_device.vdev.log_ops = log_ops;
 
 end:
 	mlx5_vf_put_core_dev(mvdev->mdev);
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 8b0ae40c620c..921d5720a1e5 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -118,7 +118,8 @@ int mlx5vf_cmd_resume_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod);
 int mlx5vf_cmd_query_vhca_migration_state(struct mlx5vf_pci_core_device *mvdev,
 					  size_t *state_size);
 void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
-			       const struct vfio_migration_ops *mig_ops);
+			       const struct vfio_migration_ops *mig_ops,
+			       const struct vfio_log_ops *log_ops);
 void mlx5vf_cmd_remove_migratable(struct mlx5vf_pci_core_device *mvdev);
 void mlx5vf_cmd_close_migratable(struct mlx5vf_pci_core_device *mvdev);
 int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index a9b63d15c5d3..759a5f5f7b3f 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -579,6 +579,12 @@ static const struct vfio_migration_ops mlx5vf_pci_mig_ops = {
 	.migration_get_state = mlx5vf_pci_get_device_state,
 };
 
+static const struct vfio_log_ops mlx5vf_pci_log_ops = {
+	.log_start = mlx5vf_start_page_tracker,
+	.log_stop = mlx5vf_stop_page_tracker,
+	.log_read_and_clear = mlx5vf_tracker_read_and_clear,
+};
+
 static const struct vfio_device_ops mlx5vf_pci_ops = {
 	.name = "mlx5-vfio-pci",
 	.open_device = mlx5vf_pci_open_device,
@@ -602,7 +608,8 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 	if (!mvdev)
 		return -ENOMEM;
 	vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops);
-	mlx5vf_cmd_set_migratable(mvdev, &mlx5vf_pci_mig_ops);
+	mlx5vf_cmd_set_migratable(mvdev, &mlx5vf_pci_mig_ops,
+				  &mlx5vf_pci_log_ops);
 	dev_set_drvdata(&pdev->dev, &mvdev->core_device);
 	ret = vfio_pci_core_register_device(&mvdev->core_device);
 	if (ret)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-14  8:12 ` [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs Yishai Hadas
@ 2022-07-18 22:29   ` Alex Williamson
  2022-07-19  1:39     ` Tian, Kevin
  2022-07-19  7:49     ` Yishai Hadas
  2022-07-21  8:45   ` Tian, Kevin
  1 sibling, 2 replies; 52+ messages in thread
From: Alex Williamson @ 2022-07-18 22:29 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck, Kirti Wankhede

On Thu, 14 Jul 2022 11:12:43 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> DMA logging allows a device to internally record what DMAs the device is
> initiating and report them back to userspace. It is part of the VFIO
> migration infrastructure that allows implementing dirty page tracking
> during the pre copy phase of live migration. Only DMA WRITEs are logged,
> and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> 
> This patch introduces the DMA logging involved uAPIs.
> 
> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
> 
> It exposes a PROBE option to detect if the device supports DMA logging.
> It exposes a SET option to start device DMA logging in given IOVAs
> ranges.
> It exposes a SET option to stop device DMA logging that was previously
> started.
> It exposes a GET option to read back and clear the device DMA log.
> 
> Extra details exist as part of vfio.h per a specific option.


Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
potentially other devices that might make use of this or is everyone
else waiting for IOMMU based dirty tracking?

 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 79 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 79 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 733a1cddde30..81475c3e7c92 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -986,6 +986,85 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>  };
>  
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
> + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
> + * DMA logging.
> + *
> + * DMA logging allows a device to internally record what DMAs the device is
> + * initiating and report them back to userspace. It is part of the VFIO
> + * migration infrastructure that allows implementing dirty page tracking
> + * during the pre copy phase of live migration. Only DMA WRITEs are logged,
> + * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> + *
> + * When DMA logging is started a range of IOVAs to monitor is provided and the
> + * device can optimize its logging to cover only the IOVA range given. Each
> + * DMA that the device initiates inside the range will be logged by the device
> + * for later retrieval.
> + *
> + * page_size is an input that hints what tracking granularity the device
> + * should try to achieve. If the device cannot do the hinted page size then it
> + * should pick the next closest page size it supports. On output the device
> + * will return the page size it selected.
> + *
> + * ranges is a pointer to an array of
> + * struct vfio_device_feature_dma_logging_range.
> + */
> +struct vfio_device_feature_dma_logging_control {
> +	__aligned_u64 page_size;
> +	__u32 num_ranges;
> +	__u32 __reserved;
> +	__aligned_u64 ranges;
> +};
> +
> +struct vfio_device_feature_dma_logging_range {
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
> + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4
> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
> + *
> + * Query the device's DMA log for written pages within the given IOVA range.
> + * During querying the log is cleared for the IOVA range.
> + *
> + * bitmap is a pointer to an array of u64s that will hold the output bitmap
> + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
> + * is given by:
> + *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
> + *
> + * The input page_size can be any power of two value and does not have to
> + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
> + * will format its internal logging to match the reporting page size, possibly
> + * by replicating bits if the internal page size is lower than requested.
> + *
> + * Bits will be updated in bitmap using atomic or to allow userspace to
> + * combine bitmaps from multiple trackers together. Therefore userspace must
> + * zero the bitmap before doing any reports.

Somewhat confusing, perhaps "between report sets"?

> + *
> + * If any error is returned userspace should assume that the dirty log is
> + * corrupted and restart.

Restart what?  The user can't just zero the bitmap and retry, dirty
information at the device has been lost.  Are we suggesting they stop
DMA logging and restart it, which sounds a lot like failing a migration
and starting over.  Or could the user gratuitously mark the bitmap
fully dirty and a subsequent logging report iteration might work?
Thanks,

Alex

> + *
> + * If DMA logging is not enabled, an error will be returned.
> + *
> + */
> +struct vfio_device_feature_dma_logging_report {
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +	__aligned_u64 page_size;
> +	__aligned_u64 bitmap;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support
  2022-07-14  8:12 ` [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support Yishai Hadas
@ 2022-07-18 22:30   ` Alex Williamson
  2022-07-18 22:46     ` Jason Gunthorpe
  2022-07-19 19:01   ` Alex Williamson
  1 sibling, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2022-07-18 22:30 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck

On Thu, 14 Jul 2022 11:12:45 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Joao Martins <joao.m.martins@oracle.com>
> 
> The new facility adds a bunch of wrappers that abstract how an IOVA
> range is represented in a bitmap that is granulated by a given
> page_size. So it translates all the lifting of dealing with user
> pointers into its corresponding kernel addresses backing said user
> memory into doing finally the bitmap ops to change various bits.
> 
> The formula for the bitmap is:
> 
>    data[(iova / page_size) / 64] & (1ULL << (iova % 64))
> 
> Where 64 is the number of bits in a unsigned long (depending on arch)
> 
> An example usage of these helpers for a given @iova, @page_size, @length
> and __user @data:
> 
> 	iova_bitmap_init(&iter.dirty, iova, __ffs(page_size));
> 	ret = iova_bitmap_iter_init(&iter, iova, length, data);
> 	if (ret)
> 		return -ENOMEM;
> 
> 	for (; !iova_bitmap_iter_done(&iter);
> 	     iova_bitmap_iter_advance(&iter)) {
> 		ret = iova_bitmap_iter_get(&iter);
> 		if (ret)
> 			break;
> 		if (dirty)
> 			iova_bitmap_set(iova_bitmap_iova(&iter),
> 					iova_bitmap_iova_length(&iter),
> 					&iter.dirty);
> 
> 		iova_bitmap_iter_put(&iter);
> 
> 		if (ret)
> 			break;
> 	}
> 
> 	iova_bitmap_iter_free(&iter);
> 
> The facility is intended to be used for user bitmaps representing
> dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/Makefile       |   6 +-
>  drivers/vfio/iova_bitmap.c  | 164 ++++++++++++++++++++++++++++++++++++
>  include/linux/iova_bitmap.h |  46 ++++++++++

I'm still working my way through the guts of this, but why is it being
proposed within the vfio driver when this is not at all vfio specific,
proposes it's own separate header, and doesn't conform with any of the
namespace conventions of being a sub-component of vfio?  Is this
ultimately meant for lib/ or perhaps an extension of iova.c within the
iommu subsystem?  Thanks,

Alex


>  3 files changed, 214 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/vfio/iova_bitmap.c
>  create mode 100644 include/linux/iova_bitmap.h
> 
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 1a32357592e3..1d6cad32d366 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,9 +1,11 @@
>  # SPDX-License-Identifier: GPL-2.0
>  vfio_virqfd-y := virqfd.o
>  
> -vfio-y += vfio_main.o
> -
>  obj-$(CONFIG_VFIO) += vfio.o
> +
> +vfio-y := vfio_main.o \
> +          iova_bitmap.o \
> +
>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
> new file mode 100644
> index 000000000000..9ad1533a6aec
> --- /dev/null
> +++ b/drivers/vfio/iova_bitmap.c
> @@ -0,0 +1,164 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022, Oracle and/or its affiliates.
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/iova_bitmap.h>
> +
> +static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
> +					       unsigned long iova_length)
> +{
> +	unsigned long pgsize = 1 << iter->dirty.pgshift;
> +
> +	return DIV_ROUND_UP(iova_length, BITS_PER_TYPE(*iter->data) * pgsize);
> +}
> +
> +static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
> +					       unsigned long index)
> +{
> +	unsigned long pgshift = iter->dirty.pgshift;
> +
> +	return (index * sizeof(*iter->data) * BITS_PER_BYTE) << pgshift;
> +}
> +
> +static unsigned long iova_bitmap_iter_left(struct iova_bitmap_iter *iter)
> +{
> +	unsigned long left = iter->count - iter->offset;
> +
> +	left = min_t(unsigned long, left,
> +		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));
> +
> +	return left;
> +}
> +
> +/*
> + * Input argument of number of bits to bitmap_set() is unsigned integer, which
> + * further casts to signed integer for unaligned multi-bit operation,
> + * __bitmap_set().
> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
> + * system.
> + */
> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
> +			  unsigned long iova, unsigned long length,
> +			  u64 __user *data)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +
> +	iter->data = data;
> +	iter->offset = 0;
> +	iter->count = iova_bitmap_iova_to_index(iter, length);
> +	iter->iova = iova;
> +	iter->length = length;
> +	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
> +
> +	return !dirty->pages ? -ENOMEM : 0;
> +}
> +
> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +
> +	if (dirty->pages) {
> +		free_page((unsigned long)dirty->pages);
> +		dirty->pages = NULL;
> +	}
> +}
> +
> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter)
> +{
> +	return iter->offset >= iter->count;
> +}
> +
> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)
> +{
> +	unsigned long max_iova = iter->dirty.iova + iter->length;
> +	unsigned long left = iova_bitmap_iter_left(iter);
> +	unsigned long iova = iova_bitmap_iova(iter);
> +
> +	left = iova_bitmap_index_to_iova(iter, left);
> +	if (iova + left > max_iova)
> +		left -= ((iova + left) - max_iova);
> +
> +	return left;
> +}
> +
> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
> +{
> +	unsigned long skip = iter->offset;
> +
> +	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
> +}
> +
> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter)
> +{
> +	unsigned long length = iova_bitmap_length(iter);
> +
> +	iter->offset += iova_bitmap_iova_to_index(iter, length);
> +}
> +
> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +
> +	if (dirty->npages)
> +		unpin_user_pages(dirty->pages, dirty->npages);
> +}
> +
> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +	unsigned long npages;
> +	u64 __user *addr;
> +	long ret;
> +
> +	npages = DIV_ROUND_UP((iter->count - iter->offset) *
> +			      sizeof(*iter->data), PAGE_SIZE);
> +	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
> +	addr = iter->data + iter->offset;
> +	ret = pin_user_pages_fast((unsigned long)addr, npages,
> +				  FOLL_WRITE, dirty->pages);
> +	if (ret <= 0)
> +		return ret;
> +
> +	dirty->npages = (unsigned long)ret;
> +	dirty->iova = iova_bitmap_iova(iter);
> +	dirty->start_offset = offset_in_page(addr);
> +	return 0;
> +}
> +
> +void iova_bitmap_init(struct iova_bitmap *bitmap,
> +		      unsigned long base, unsigned long pgshift)
> +{
> +	memset(bitmap, 0, sizeof(*bitmap));
> +	bitmap->iova = base;
> +	bitmap->pgshift = pgshift;
> +}
> +
> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
> +			     unsigned long iova,
> +			     unsigned long length)
> +{
> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
> +
> +	nbits = max(1UL, length >> dirty->pgshift);
> +	offset = (iova - dirty->iova) >> dirty->pgshift;
> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
> +	start_offset = dirty->start_offset;
> +
> +	while (nbits > 0) {
> +		kaddr = kmap_local_page(dirty->pages[idx]) + start_offset;
> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
> +		bitmap_set(kaddr, offset, size);
> +		kunmap_local(kaddr - start_offset);
> +		start_offset = offset = 0;
> +		nbits -= size;
> +		idx++;
> +	}
> +
> +	return nbits;
> +}
> +EXPORT_SYMBOL_GPL(iova_bitmap_set);
> +
> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
> new file mode 100644
> index 000000000000..c474c351634a
> --- /dev/null
> +++ b/include/linux/iova_bitmap.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2022, Oracle and/or its affiliates.
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#ifndef _IOVA_BITMAP_H_
> +#define _IOVA_BITMAP_H_
> +
> +#include <linux/highmem.h>
> +#include <linux/mm.h>
> +#include <linux/uio.h>
> +
> +struct iova_bitmap {
> +	unsigned long iova;
> +	unsigned long pgshift;
> +	unsigned long start_offset;
> +	unsigned long npages;
> +	struct page **pages;
> +};
> +
> +struct iova_bitmap_iter {
> +	struct iova_bitmap dirty;
> +	u64 __user *data;
> +	size_t offset;
> +	size_t count;
> +	unsigned long iova;
> +	unsigned long length;
> +};
> +
> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
> +			  unsigned long length, u64 __user *data);
> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter);
> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
> +void iova_bitmap_init(struct iova_bitmap *bitmap,
> +		      unsigned long base, unsigned long pgshift);
> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
> +			     unsigned long iova,
> +			     unsigned long length);
> +
> +#endif


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-14  8:12 ` [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support Yishai Hadas
@ 2022-07-18 22:30   ` Alex Williamson
  2022-07-19  9:19     ` Yishai Hadas
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2022-07-18 22:30 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck

On Thu, 14 Jul 2022 11:12:46 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> Introduce the DMA logging feature support in the vfio core layer.
> 
> It includes the processing of the device start/stop/report DMA logging
> UAPIs and calling the relevant driver 'op' to do the work.
> 
> Specifically,
> Upon start, the core translates the given input ranges into an interval
> tree, checks for unexpected overlapping, non aligned ranges and then
> pass the translated input to the driver for start tracking the given
> ranges.
> 
> Upon report, the core translates the given input user space bitmap and
> page size into an IOVA kernel bitmap iterator. Then it iterates it and
> call the driver to set the corresponding bits for the dirtied pages in a
> specific IOVA range.
> 
> Upon stop, the driver is called to stop the previous started tracking.
> 
> The next patches from the series will introduce the mlx5 driver
> implementation for the logging ops.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/pci/vfio_pci_core.c |   5 +
>  drivers/vfio/vfio_main.c         | 161 +++++++++++++++++++++++++++++++
>  include/linux/vfio.h             |  21 +++-
>  4 files changed, 186 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 6130d00252ed..86c381ceb9a1 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -3,6 +3,7 @@ menuconfig VFIO
>  	tristate "VFIO Non-Privileged userspace driver framework"
>  	select IOMMU_API
>  	select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
> +	select INTERVAL_TREE
>  	help
>  	  VFIO provides a framework for secure userspace device drivers.
>  	  See Documentation/driver-api/vfio.rst for more details.
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 2efa06b1fafa..b6dabf398251 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1862,6 +1862,11 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  			return -EINVAL;
>  	}
>  
> +	if (vdev->vdev.log_ops && !(vdev->vdev.log_ops->log_start &&
> +	    vdev->vdev.log_ops->log_stop &&
> +	    vdev->vdev.log_ops->log_read_and_clear))
> +		return -EINVAL;
> +
>  	/*
>  	 * Prevent binding to PFs with VFs enabled, the VFs might be in use
>  	 * by the host or other users.  We cannot capture the VFs if they
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index bd84ca7c5e35..2414d827e3c8 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -32,6 +32,8 @@
>  #include <linux/vfio.h>
>  #include <linux/wait.h>
>  #include <linux/sched/signal.h>
> +#include <linux/interval_tree.h>
> +#include <linux/iova_bitmap.h>
>  #include "vfio.h"
>  
>  #define DRIVER_VERSION	"0.3"
> @@ -1603,6 +1605,153 @@ static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
>  	return 0;
>  }
>  
> +#define LOG_MAX_RANGES 1024
> +
> +static int
> +vfio_ioctl_device_feature_logging_start(struct vfio_device *device,
> +					u32 flags, void __user *arg,
> +					size_t argsz)
> +{
> +	size_t minsz =
> +		offsetofend(struct vfio_device_feature_dma_logging_control,
> +			    ranges);
> +	struct vfio_device_feature_dma_logging_range __user *ranges;
> +	struct vfio_device_feature_dma_logging_control control;
> +	struct vfio_device_feature_dma_logging_range range;
> +	struct rb_root_cached root = RB_ROOT_CACHED;
> +	struct interval_tree_node *nodes;
> +	u32 nnodes;
> +	int i, ret;
> +
> +	if (!device->log_ops)
> +		return -ENOTTY;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_SET,
> +				 sizeof(control));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&control, arg, minsz))
> +		return -EFAULT;
> +
> +	nnodes = control.num_ranges;
> +	if (!nnodes || nnodes > LOG_MAX_RANGES)
> +		return -EINVAL;

The latter looks more like an -E2BIG errno.  This is a hard coded
limit, but what are the heuristics?  Can a user introspect the limit?
Thanks,

Alex

> +
> +	ranges = u64_to_user_ptr(control.ranges);
> +	nodes = kmalloc_array(nnodes, sizeof(struct interval_tree_node),
> +			      GFP_KERNEL);
> +	if (!nodes)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < nnodes; i++) {
> +		if (copy_from_user(&range, &ranges[i], sizeof(range))) {
> +			ret = -EFAULT;
> +			goto end;
> +		}
> +		if (!IS_ALIGNED(range.iova, control.page_size) ||
> +		    !IS_ALIGNED(range.length, control.page_size)) {
> +			ret = -EINVAL;
> +			goto end;
> +		}
> +		nodes[i].start = range.iova;
> +		nodes[i].last = range.iova + range.length - 1;
> +		if (interval_tree_iter_first(&root, nodes[i].start,
> +					     nodes[i].last)) {
> +			/* Range overlapping */
> +			ret = -EINVAL;
> +			goto end;
> +		}
> +		interval_tree_insert(nodes + i, &root);
> +	}
> +
> +	ret = device->log_ops->log_start(device, &root, nnodes,
> +					 &control.page_size);
> +	if (ret)
> +		goto end;
> +
> +	if (copy_to_user(arg, &control, sizeof(control))) {
> +		ret = -EFAULT;
> +		device->log_ops->log_stop(device);
> +	}
> +
> +end:
> +	kfree(nodes);
> +	return ret;
> +}
> +
> +static int
> +vfio_ioctl_device_feature_logging_stop(struct vfio_device *device,
> +				       u32 flags, void __user *arg,
> +				       size_t argsz)
> +{
> +	int ret;
> +
> +	if (!device->log_ops)
> +		return -ENOTTY;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_SET, 0);
> +	if (ret != 1)
> +		return ret;
> +
> +	return device->log_ops->log_stop(device);
> +}
> +
> +static int
> +vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
> +					 u32 flags, void __user *arg,
> +					 size_t argsz)
> +{
> +	size_t minsz =
> +		offsetofend(struct vfio_device_feature_dma_logging_report,
> +			    bitmap);
> +	struct vfio_device_feature_dma_logging_report report;
> +	struct iova_bitmap_iter iter;
> +	int ret;
> +
> +	if (!device->log_ops)
> +		return -ENOTTY;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(report));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&report, arg, minsz))
> +		return -EFAULT;
> +
> +	if (report.page_size < PAGE_SIZE)
> +		return -EINVAL;
> +
> +	iova_bitmap_init(&iter.dirty, report.iova, ilog2(report.page_size));
> +	ret = iova_bitmap_iter_init(&iter, report.iova, report.length,
> +				    u64_to_user_ptr(report.bitmap));
> +	if (ret)
> +		return ret;
> +
> +	for (; !iova_bitmap_iter_done(&iter);
> +	     iova_bitmap_iter_advance(&iter)) {
> +		ret = iova_bitmap_iter_get(&iter);
> +		if (ret)
> +			break;
> +
> +		ret = device->log_ops->log_read_and_clear(device,
> +			iova_bitmap_iova(&iter),
> +			iova_bitmap_length(&iter), &iter.dirty);
> +
> +		iova_bitmap_iter_put(&iter);
> +
> +		if (ret)
> +			break;
> +	}
> +
> +	iova_bitmap_iter_free(&iter);
> +	return ret;
> +}
> +
>  static int vfio_ioctl_device_feature(struct vfio_device *device,
>  				     struct vfio_device_feature __user *arg)
>  {
> @@ -1636,6 +1785,18 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>  		return vfio_ioctl_device_feature_mig_device_state(
>  			device, feature.flags, arg->data,
>  			feature.argsz - minsz);
> +	case VFIO_DEVICE_FEATURE_DMA_LOGGING_START:
> +		return vfio_ioctl_device_feature_logging_start(
> +			device, feature.flags, arg->data,
> +			feature.argsz - minsz);
> +	case VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP:
> +		return vfio_ioctl_device_feature_logging_stop(
> +			device, feature.flags, arg->data,
> +			feature.argsz - minsz);
> +	case VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT:
> +		return vfio_ioctl_device_feature_logging_report(
> +			device, feature.flags, arg->data,
> +			feature.argsz - minsz);
>  	default:
>  		if (unlikely(!device->ops->device_feature))
>  			return -EINVAL;
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 4d26e149db81..feed84d686ec 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -14,6 +14,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
> +#include <linux/iova_bitmap.h>
>  
>  struct kvm;
>  
> @@ -33,10 +34,11 @@ struct vfio_device {
>  	struct device *dev;
>  	const struct vfio_device_ops *ops;
>  	/*
> -	 * mig_ops is a static property of the vfio_device which must be set
> -	 * prior to registering the vfio_device.
> +	 * mig_ops/log_ops is a static property of the vfio_device which must
> +	 * be set prior to registering the vfio_device.
>  	 */
>  	const struct vfio_migration_ops *mig_ops;
> +	const struct vfio_log_ops *log_ops;
>  	struct vfio_group *group;
>  	struct vfio_device_set *dev_set;
>  	struct list_head dev_set_list;
> @@ -104,6 +106,21 @@ struct vfio_migration_ops {
>  				   enum vfio_device_mig_state *curr_state);
>  };
>  
> +/**
> + * @log_start: Optional callback to ask the device start DMA logging.
> + * @log_stop: Optional callback to ask the device stop DMA logging.
> + * @log_read_and_clear: Optional callback to ask the device read
> + *         and clear the dirty DMAs in some given range.
> + */
> +struct vfio_log_ops {
> +	int (*log_start)(struct vfio_device *device,
> +		struct rb_root_cached *ranges, u32 nnodes, u64 *page_size);
> +	int (*log_stop)(struct vfio_device *device);
> +	int (*log_read_and_clear)(struct vfio_device *device,
> +		unsigned long iova, unsigned long length,
> +		struct iova_bitmap *dirty);
> +};
> +
>  /**
>   * vfio_check_feature - Validate user input for the VFIO_DEVICE_FEATURE ioctl
>   * @flags: Arg from the device_feature op


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support
  2022-07-18 22:30   ` Alex Williamson
@ 2022-07-18 22:46     ` Jason Gunthorpe
  0 siblings, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-18 22:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, saeedm, kvm, netdev, kuba, kevin.tian,
	joao.m.martins, leonro, maorg, cohuck

On Mon, Jul 18, 2022 at 04:30:10PM -0600, Alex Williamson wrote:

> > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > ---
> >  drivers/vfio/Makefile       |   6 +-
> >  drivers/vfio/iova_bitmap.c  | 164 ++++++++++++++++++++++++++++++++++++
> >  include/linux/iova_bitmap.h |  46 ++++++++++
> 
> I'm still working my way through the guts of this, but why is it being
> proposed within the vfio driver when this is not at all vfio specific,
> proposes it's own separate header, and doesn't conform with any of the
> namespace conventions of being a sub-component of vfio?  Is this
> ultimately meant for lib/ or perhaps an extension of iova.c within the
> iommu subsystem?  Thanks,

I am expecting when iommufd dirty tracking comes we will move this
file into drivers/iommu/iommufd/ and it will provide it. So it was
written to make that a simple rename vs changing everything.

Until we have that, this seems like the best place for it

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-18 22:29   ` Alex Williamson
@ 2022-07-19  1:39     ` Tian, Kevin
  2022-07-19  5:40       ` Kirti Wankhede
  2022-07-19  7:49     ` Yishai Hadas
  1 sibling, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-19  1:39 UTC (permalink / raw)
  To: Alex Williamson, Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, Martins, Joao, leonro, maorg,
	cohuck, Kirti Wankhede

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, July 19, 2022 6:30 AM
> 
> On Thu, 14 Jul 2022 11:12:43 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > DMA logging allows a device to internally record what DMAs the device is
> > initiating and report them back to userspace. It is part of the VFIO
> > migration infrastructure that allows implementing dirty page tracking
> > during the pre copy phase of live migration. Only DMA WRITEs are logged,
> > and this API is not connected to
> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> >
> > This patch introduces the DMA logging involved uAPIs.
> >
> > It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
> >
> > It exposes a PROBE option to detect if the device supports DMA logging.
> > It exposes a SET option to start device DMA logging in given IOVAs
> > ranges.
> > It exposes a SET option to stop device DMA logging that was previously
> > started.
> > It exposes a GET option to read back and clear the device DMA log.
> >
> > Extra details exist as part of vfio.h per a specific option.
> 
> 
> Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
> potentially other devices that might make use of this or is everyone
> else waiting for IOMMU based dirty tracking?
> 

I plan to take a look later this week.

From Intel side I'm not aware of such device so far and IOMMU based
dirty tracking would be the standard way to go.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-19  1:39     ` Tian, Kevin
@ 2022-07-19  5:40       ` Kirti Wankhede
  0 siblings, 0 replies; 52+ messages in thread
From: Kirti Wankhede @ 2022-07-19  5:40 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, Martins, Joao, leonro, maorg,
	cohuck, Neo Jia, Tarun Gupta, Shounak Deshpande



On 7/19/2022 7:09 AM, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Tuesday, July 19, 2022 6:30 AM
>>
>> On Thu, 14 Jul 2022 11:12:43 +0300
>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>
>>> DMA logging allows a device to internally record what DMAs the device is
>>> initiating and report them back to userspace. It is part of the VFIO
>>> migration infrastructure that allows implementing dirty page tracking
>>> during the pre copy phase of live migration. Only DMA WRITEs are logged,
>>> and this API is not connected to
>> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
>>>
>>> This patch introduces the DMA logging involved uAPIs.
>>>
>>> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
>>>
>>> It exposes a PROBE option to detect if the device supports DMA logging.
>>> It exposes a SET option to start device DMA logging in given IOVAs
>>> ranges.
>>> It exposes a SET option to stop device DMA logging that was previously
>>> started.
>>> It exposes a GET option to read back and clear the device DMA log.
>>>
>>> Extra details exist as part of vfio.h per a specific option.
>>
>>
>> Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
>> potentially other devices that might make use of this or is everyone
>> else waiting for IOMMU based dirty tracking?
>>
> 

I had briefly screened through it, I'm taking a closer look of it.

NVIDIA vGPU might use this API.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-18 22:29   ` Alex Williamson
  2022-07-19  1:39     ` Tian, Kevin
@ 2022-07-19  7:49     ` Yishai Hadas
  2022-07-19 19:57       ` Alex Williamson
  1 sibling, 1 reply; 52+ messages in thread
From: Yishai Hadas @ 2022-07-19  7:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck, Kirti Wankhede

On 19/07/2022 1:29, Alex Williamson wrote:
> On Thu, 14 Jul 2022 11:12:43 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> DMA logging allows a device to internally record what DMAs the device is
>> initiating and report them back to userspace. It is part of the VFIO
>> migration infrastructure that allows implementing dirty page tracking
>> during the pre copy phase of live migration. Only DMA WRITEs are logged,
>> and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
>>
>> This patch introduces the DMA logging involved uAPIs.
>>
>> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
>>
>> It exposes a PROBE option to detect if the device supports DMA logging.
>> It exposes a SET option to start device DMA logging in given IOVAs
>> ranges.
>> It exposes a SET option to stop device DMA logging that was previously
>> started.
>> It exposes a GET option to read back and clear the device DMA log.
>>
>> Extra details exist as part of vfio.h per a specific option.
>
> Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
> potentially other devices that might make use of this or is everyone
> else waiting for IOMMU based dirty tracking?
>
>   
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> ---
>>   include/uapi/linux/vfio.h | 79 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 79 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 733a1cddde30..81475c3e7c92 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -986,6 +986,85 @@ enum vfio_device_mig_state {
>>   	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>   };
>>   
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
>> + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
>> + * DMA logging.
>> + *
>> + * DMA logging allows a device to internally record what DMAs the device is
>> + * initiating and report them back to userspace. It is part of the VFIO
>> + * migration infrastructure that allows implementing dirty page tracking
>> + * during the pre copy phase of live migration. Only DMA WRITEs are logged,
>> + * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
>> + *
>> + * When DMA logging is started a range of IOVAs to monitor is provided and the
>> + * device can optimize its logging to cover only the IOVA range given. Each
>> + * DMA that the device initiates inside the range will be logged by the device
>> + * for later retrieval.
>> + *
>> + * page_size is an input that hints what tracking granularity the device
>> + * should try to achieve. If the device cannot do the hinted page size then it
>> + * should pick the next closest page size it supports. On output the device
>> + * will return the page size it selected.
>> + *
>> + * ranges is a pointer to an array of
>> + * struct vfio_device_feature_dma_logging_range.
>> + */
>> +struct vfio_device_feature_dma_logging_control {
>> +	__aligned_u64 page_size;
>> +	__u32 num_ranges;
>> +	__u32 __reserved;
>> +	__aligned_u64 ranges;
>> +};
>> +
>> +struct vfio_device_feature_dma_logging_range {
>> +	__aligned_u64 iova;
>> +	__aligned_u64 length;
>> +};
>> +
>> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
>> +
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
>> + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
>> + */
>> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4
>> +
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
>> + *
>> + * Query the device's DMA log for written pages within the given IOVA range.
>> + * During querying the log is cleared for the IOVA range.
>> + *
>> + * bitmap is a pointer to an array of u64s that will hold the output bitmap
>> + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
>> + * is given by:
>> + *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
>> + *
>> + * The input page_size can be any power of two value and does not have to
>> + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
>> + * will format its internal logging to match the reporting page size, possibly
>> + * by replicating bits if the internal page size is lower than requested.
>> + *
>> + * Bits will be updated in bitmap using atomic or to allow userspace to
>> + * combine bitmaps from multiple trackers together. Therefore userspace must
>> + * zero the bitmap before doing any reports.
> Somewhat confusing, perhaps "between report sets"?

The idea was that the driver just turns on its own dirty bits and 
doesn't touch others.

Do you suggest the below ?

"Therefore userspace must zero the bitmap between report sets".

>
>> + *
>> + * If any error is returned userspace should assume that the dirty log is
>> + * corrupted and restart.
> Restart what?  The user can't just zero the bitmap and retry, dirty
> information at the device has been lost.

Right

>   Are we suggesting they stop
> DMA logging and restart it, which sounds a lot like failing a migration
> and starting over.  Or could the user gratuitously mark the bitmap
> fully dirty and a subsequent logging report iteration might work?
> Thanks,
>
> Alex

An error at that step is not expected and might be fatal.

User space can consider marking all as dirty and continue with that 
approach for next iterations, maybe even without calling the driver.

Alternatively, user space can abort the migration and retry later on.

We can come with some rephrasing as of the above.

What do you think ?

Yishai

>> + *
>> + * If DMA logging is not enabled, an error will be returned.
>> + *
>> + */
>> +struct vfio_device_feature_dma_logging_report {
>> +	__aligned_u64 iova;
>> +	__aligned_u64 length;
>> +	__aligned_u64 page_size;
>> +	__aligned_u64 bitmap;
>> +};
>> +
>> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
>> +
>>   /* -------- API for Type1 VFIO IOMMU -------- */
>>   
>>   /**



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-18 22:30   ` Alex Williamson
@ 2022-07-19  9:19     ` Yishai Hadas
  2022-07-19 19:25       ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Yishai Hadas @ 2022-07-19  9:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck

On 19/07/2022 1:30, Alex Williamson wrote:
> On Thu, 14 Jul 2022 11:12:46 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> Introduce the DMA logging feature support in the vfio core layer.
>>
>> It includes the processing of the device start/stop/report DMA logging
>> UAPIs and calling the relevant driver 'op' to do the work.
>>
>> Specifically,
>> Upon start, the core translates the given input ranges into an interval
>> tree, checks for unexpected overlapping, non aligned ranges and then
>> pass the translated input to the driver for start tracking the given
>> ranges.
>>
>> Upon report, the core translates the given input user space bitmap and
>> page size into an IOVA kernel bitmap iterator. Then it iterates it and
>> call the driver to set the corresponding bits for the dirtied pages in a
>> specific IOVA range.
>>
>> Upon stop, the driver is called to stop the previous started tracking.
>>
>> The next patches from the series will introduce the mlx5 driver
>> implementation for the logging ops.
>>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>>   drivers/vfio/Kconfig             |   1 +
>>   drivers/vfio/pci/vfio_pci_core.c |   5 +
>>   drivers/vfio/vfio_main.c         | 161 +++++++++++++++++++++++++++++++
>>   include/linux/vfio.h             |  21 +++-
>>   4 files changed, 186 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
>> index 6130d00252ed..86c381ceb9a1 100644
>> --- a/drivers/vfio/Kconfig
>> +++ b/drivers/vfio/Kconfig
>> @@ -3,6 +3,7 @@ menuconfig VFIO
>>   	tristate "VFIO Non-Privileged userspace driver framework"
>>   	select IOMMU_API
>>   	select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
>> +	select INTERVAL_TREE
>>   	help
>>   	  VFIO provides a framework for secure userspace device drivers.
>>   	  See Documentation/driver-api/vfio.rst for more details.
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 2efa06b1fafa..b6dabf398251 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -1862,6 +1862,11 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>   			return -EINVAL;
>>   	}
>>   
>> +	if (vdev->vdev.log_ops && !(vdev->vdev.log_ops->log_start &&
>> +	    vdev->vdev.log_ops->log_stop &&
>> +	    vdev->vdev.log_ops->log_read_and_clear))
>> +		return -EINVAL;
>> +
>>   	/*
>>   	 * Prevent binding to PFs with VFs enabled, the VFs might be in use
>>   	 * by the host or other users.  We cannot capture the VFs if they
>> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
>> index bd84ca7c5e35..2414d827e3c8 100644
>> --- a/drivers/vfio/vfio_main.c
>> +++ b/drivers/vfio/vfio_main.c
>> @@ -32,6 +32,8 @@
>>   #include <linux/vfio.h>
>>   #include <linux/wait.h>
>>   #include <linux/sched/signal.h>
>> +#include <linux/interval_tree.h>
>> +#include <linux/iova_bitmap.h>
>>   #include "vfio.h"
>>   
>>   #define DRIVER_VERSION	"0.3"
>> @@ -1603,6 +1605,153 @@ static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
>>   	return 0;
>>   }
>>   
>> +#define LOG_MAX_RANGES 1024
>> +
>> +static int
>> +vfio_ioctl_device_feature_logging_start(struct vfio_device *device,
>> +					u32 flags, void __user *arg,
>> +					size_t argsz)
>> +{
>> +	size_t minsz =
>> +		offsetofend(struct vfio_device_feature_dma_logging_control,
>> +			    ranges);
>> +	struct vfio_device_feature_dma_logging_range __user *ranges;
>> +	struct vfio_device_feature_dma_logging_control control;
>> +	struct vfio_device_feature_dma_logging_range range;
>> +	struct rb_root_cached root = RB_ROOT_CACHED;
>> +	struct interval_tree_node *nodes;
>> +	u32 nnodes;
>> +	int i, ret;
>> +
>> +	if (!device->log_ops)
>> +		return -ENOTTY;
>> +
>> +	ret = vfio_check_feature(flags, argsz,
>> +				 VFIO_DEVICE_FEATURE_SET,
>> +				 sizeof(control));
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	if (copy_from_user(&control, arg, minsz))
>> +		return -EFAULT;
>> +
>> +	nnodes = control.num_ranges;
>> +	if (!nnodes || nnodes > LOG_MAX_RANGES)
>> +		return -EINVAL;
> The latter looks more like an -E2BIG errno.

OK

> This is a hard coded
> limit, but what are the heuristics?  Can a user introspect the limit?
> Thanks,
>
> Alex

This hard coded value just comes to prevent user space from exploding 
kernel memory allocation.

We don't really expect user space to hit this limit, the RAM in QEMU is 
divided today to around ~12 ranges as we saw so far in our evaluation.

We may also expect user space to combine contiguous ranges to a single 
range or in the worst case even to combine non contiguous ranges to a 
single range.

We can consider moving this hard-coded value to be part of the UAPI 
header, although, not sure that this is really a must.

What do you think ?

Yishai

>
>> +
>> +	ranges = u64_to_user_ptr(control.ranges);
>> +	nodes = kmalloc_array(nnodes, sizeof(struct interval_tree_node),
>> +			      GFP_KERNEL);
>> +	if (!nodes)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < nnodes; i++) {
>> +		if (copy_from_user(&range, &ranges[i], sizeof(range))) {
>> +			ret = -EFAULT;
>> +			goto end;
>> +		}
>> +		if (!IS_ALIGNED(range.iova, control.page_size) ||
>> +		    !IS_ALIGNED(range.length, control.page_size)) {
>> +			ret = -EINVAL;
>> +			goto end;
>> +		}
>> +		nodes[i].start = range.iova;
>> +		nodes[i].last = range.iova + range.length - 1;
>> +		if (interval_tree_iter_first(&root, nodes[i].start,
>> +					     nodes[i].last)) {
>> +			/* Range overlapping */
>> +			ret = -EINVAL;
>> +			goto end;
>> +		}
>> +		interval_tree_insert(nodes + i, &root);
>> +	}
>> +
>> +	ret = device->log_ops->log_start(device, &root, nnodes,
>> +					 &control.page_size);
>> +	if (ret)
>> +		goto end;
>> +
>> +	if (copy_to_user(arg, &control, sizeof(control))) {
>> +		ret = -EFAULT;
>> +		device->log_ops->log_stop(device);
>> +	}
>> +
>> +end:
>> +	kfree(nodes);
>> +	return ret;
>> +}
>> +
>> +static int
>> +vfio_ioctl_device_feature_logging_stop(struct vfio_device *device,
>> +				       u32 flags, void __user *arg,
>> +				       size_t argsz)
>> +{
>> +	int ret;
>> +
>> +	if (!device->log_ops)
>> +		return -ENOTTY;
>> +
>> +	ret = vfio_check_feature(flags, argsz,
>> +				 VFIO_DEVICE_FEATURE_SET, 0);
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	return device->log_ops->log_stop(device);
>> +}
>> +
>> +static int
>> +vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
>> +					 u32 flags, void __user *arg,
>> +					 size_t argsz)
>> +{
>> +	size_t minsz =
>> +		offsetofend(struct vfio_device_feature_dma_logging_report,
>> +			    bitmap);
>> +	struct vfio_device_feature_dma_logging_report report;
>> +	struct iova_bitmap_iter iter;
>> +	int ret;
>> +
>> +	if (!device->log_ops)
>> +		return -ENOTTY;
>> +
>> +	ret = vfio_check_feature(flags, argsz,
>> +				 VFIO_DEVICE_FEATURE_GET,
>> +				 sizeof(report));
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	if (copy_from_user(&report, arg, minsz))
>> +		return -EFAULT;
>> +
>> +	if (report.page_size < PAGE_SIZE)
>> +		return -EINVAL;
>> +
>> +	iova_bitmap_init(&iter.dirty, report.iova, ilog2(report.page_size));
>> +	ret = iova_bitmap_iter_init(&iter, report.iova, report.length,
>> +				    u64_to_user_ptr(report.bitmap));
>> +	if (ret)
>> +		return ret;
>> +
>> +	for (; !iova_bitmap_iter_done(&iter);
>> +	     iova_bitmap_iter_advance(&iter)) {
>> +		ret = iova_bitmap_iter_get(&iter);
>> +		if (ret)
>> +			break;
>> +
>> +		ret = device->log_ops->log_read_and_clear(device,
>> +			iova_bitmap_iova(&iter),
>> +			iova_bitmap_length(&iter), &iter.dirty);
>> +
>> +		iova_bitmap_iter_put(&iter);
>> +
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	iova_bitmap_iter_free(&iter);
>> +	return ret;
>> +}
>> +
>>   static int vfio_ioctl_device_feature(struct vfio_device *device,
>>   				     struct vfio_device_feature __user *arg)
>>   {
>> @@ -1636,6 +1785,18 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>>   		return vfio_ioctl_device_feature_mig_device_state(
>>   			device, feature.flags, arg->data,
>>   			feature.argsz - minsz);
>> +	case VFIO_DEVICE_FEATURE_DMA_LOGGING_START:
>> +		return vfio_ioctl_device_feature_logging_start(
>> +			device, feature.flags, arg->data,
>> +			feature.argsz - minsz);
>> +	case VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP:
>> +		return vfio_ioctl_device_feature_logging_stop(
>> +			device, feature.flags, arg->data,
>> +			feature.argsz - minsz);
>> +	case VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT:
>> +		return vfio_ioctl_device_feature_logging_report(
>> +			device, feature.flags, arg->data,
>> +			feature.argsz - minsz);
>>   	default:
>>   		if (unlikely(!device->ops->device_feature))
>>   			return -EINVAL;
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 4d26e149db81..feed84d686ec 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -14,6 +14,7 @@
>>   #include <linux/workqueue.h>
>>   #include <linux/poll.h>
>>   #include <uapi/linux/vfio.h>
>> +#include <linux/iova_bitmap.h>
>>   
>>   struct kvm;
>>   
>> @@ -33,10 +34,11 @@ struct vfio_device {
>>   	struct device *dev;
>>   	const struct vfio_device_ops *ops;
>>   	/*
>> -	 * mig_ops is a static property of the vfio_device which must be set
>> -	 * prior to registering the vfio_device.
>> +	 * mig_ops/log_ops is a static property of the vfio_device which must
>> +	 * be set prior to registering the vfio_device.
>>   	 */
>>   	const struct vfio_migration_ops *mig_ops;
>> +	const struct vfio_log_ops *log_ops;
>>   	struct vfio_group *group;
>>   	struct vfio_device_set *dev_set;
>>   	struct list_head dev_set_list;
>> @@ -104,6 +106,21 @@ struct vfio_migration_ops {
>>   				   enum vfio_device_mig_state *curr_state);
>>   };
>>   
>> +/**
>> + * @log_start: Optional callback to ask the device start DMA logging.
>> + * @log_stop: Optional callback to ask the device stop DMA logging.
>> + * @log_read_and_clear: Optional callback to ask the device read
>> + *         and clear the dirty DMAs in some given range.
>> + */
>> +struct vfio_log_ops {
>> +	int (*log_start)(struct vfio_device *device,
>> +		struct rb_root_cached *ranges, u32 nnodes, u64 *page_size);
>> +	int (*log_stop)(struct vfio_device *device);
>> +	int (*log_read_and_clear)(struct vfio_device *device,
>> +		unsigned long iova, unsigned long length,
>> +		struct iova_bitmap *dirty);
>> +};
>> +
>>   /**
>>    * vfio_check_feature - Validate user input for the VFIO_DEVICE_FEATURE ioctl
>>    * @flags: Arg from the device_feature op



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support
  2022-07-14  8:12 ` [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support Yishai Hadas
  2022-07-18 22:30   ` Alex Williamson
@ 2022-07-19 19:01   ` Alex Williamson
  2022-07-20  1:57     ` Joao Martins
  1 sibling, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2022-07-19 19:01 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck

On Thu, 14 Jul 2022 11:12:45 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Joao Martins <joao.m.martins@oracle.com>
> 
> The new facility adds a bunch of wrappers that abstract how an IOVA
> range is represented in a bitmap that is granulated by a given
> page_size. So it translates all the lifting of dealing with user
> pointers into its corresponding kernel addresses backing said user
> memory into doing finally the bitmap ops to change various bits.
> 
> The formula for the bitmap is:
> 
>    data[(iova / page_size) / 64] & (1ULL << (iova % 64))
> 
> Where 64 is the number of bits in a unsigned long (depending on arch)
> 
> An example usage of these helpers for a given @iova, @page_size, @length
> and __user @data:
> 
> 	iova_bitmap_init(&iter.dirty, iova, __ffs(page_size));
> 	ret = iova_bitmap_iter_init(&iter, iova, length, data);

Why are these separate functions given this use case?

> 	if (ret)
> 		return -ENOMEM;
> 
> 	for (; !iova_bitmap_iter_done(&iter);
> 	     iova_bitmap_iter_advance(&iter)) {
> 		ret = iova_bitmap_iter_get(&iter);
> 		if (ret)
> 			break;
> 		if (dirty)
> 			iova_bitmap_set(iova_bitmap_iova(&iter),
> 					iova_bitmap_iova_length(&iter),
> 					&iter.dirty);
> 
> 		iova_bitmap_iter_put(&iter);
> 
> 		if (ret)
> 			break;

This break is unreachable.

> 	}
> 
> 	iova_bitmap_iter_free(&iter);
> 
> The facility is intended to be used for user bitmaps representing
> dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/Makefile       |   6 +-
>  drivers/vfio/iova_bitmap.c  | 164 ++++++++++++++++++++++++++++++++++++
>  include/linux/iova_bitmap.h |  46 ++++++++++
>  3 files changed, 214 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/vfio/iova_bitmap.c
>  create mode 100644 include/linux/iova_bitmap.h
> 
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 1a32357592e3..1d6cad32d366 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,9 +1,11 @@
>  # SPDX-License-Identifier: GPL-2.0
>  vfio_virqfd-y := virqfd.o
>  
> -vfio-y += vfio_main.o
> -
>  obj-$(CONFIG_VFIO) += vfio.o
> +
> +vfio-y := vfio_main.o \
> +          iova_bitmap.o \
> +
>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
> new file mode 100644
> index 000000000000..9ad1533a6aec
> --- /dev/null
> +++ b/drivers/vfio/iova_bitmap.c
> @@ -0,0 +1,164 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022, Oracle and/or its affiliates.
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/iova_bitmap.h>
> +
> +static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
> +					       unsigned long iova_length)

If we have an iova-to-index function, why do we pass it a length?  That
seems to be conflating the use cases where the caller is trying to
determine the last index for a range with the actual implementation of
this helper.

> +{
> +	unsigned long pgsize = 1 << iter->dirty.pgshift;
> +
> +	return DIV_ROUND_UP(iova_length, BITS_PER_TYPE(*iter->data) * pgsize);

ROUND_UP here doesn't make sense to me and is not symmetric with the
below index-to-iova function.  For example an iova of 0x1000 give me an
index of 1, but index of 1 gives me an iova of 0x40000.  Does this code
work??

> +}
> +
> +static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
> +					       unsigned long index)
> +{
> +	unsigned long pgshift = iter->dirty.pgshift;
> +
> +	return (index * sizeof(*iter->data) * BITS_PER_BYTE) << pgshift;
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Isn't that BITS_PER_TYPE(*iter->data), just as in the previous function?

> +}
> +
> +static unsigned long iova_bitmap_iter_left(struct iova_bitmap_iter *iter)

I think this is trying to find "remaining" whereas "left" can be
confused with a direction.

> +{
> +	unsigned long left = iter->count - iter->offset;
> +
> +	left = min_t(unsigned long, left,
> +		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));

Ugh, dirty.npages is zero'd on bitmap init, allocated on get and left
with stale data on put.  This really needs some documentation/theory of
operation.

> +
> +	return left;
> +}
> +
> +/*
> + * Input argument of number of bits to bitmap_set() is unsigned integer, which
> + * further casts to signed integer for unaligned multi-bit operation,
> + * __bitmap_set().
> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
> + * system.
> + */

This is all true and familiar, but what's it doing here?  The type1
code this comes from uses this to justify some #defines that are used
to sanitize input.  I see no such enforcement in this code.  The only
comment in this whole patch and it seems irrelevant.

> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
> +			  unsigned long iova, unsigned long length,
> +			  u64 __user *data)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +
> +	iter->data = data;
> +	iter->offset = 0;
> +	iter->count = iova_bitmap_iova_to_index(iter, length);

If this works, it's because the DIV_ROUND_UP above accounted for what
should have been and index-to-count fixup here, ie. add one.

> +	iter->iova = iova;
> +	iter->length = length;
> +	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
> +
> +	return !dirty->pages ? -ENOMEM : 0;
> +}
> +
> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +
> +	if (dirty->pages) {
> +		free_page((unsigned long)dirty->pages);
> +		dirty->pages = NULL;
> +	}
> +}
> +
> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter)
> +{
> +	return iter->offset >= iter->count;
> +}
> +
> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)
> +{
> +	unsigned long max_iova = iter->dirty.iova + iter->length;
> +	unsigned long left = iova_bitmap_iter_left(iter);
> +	unsigned long iova = iova_bitmap_iova(iter);
> +
> +	left = iova_bitmap_index_to_iova(iter, left);

@left is first used for number of indexes and then for an iova range :(

> +	if (iova + left > max_iova)
> +		left -= ((iova + left) - max_iova);
> +
> +	return left;
> +}

IIUC, this is returning the iova free space in the bitmap, not the
length of the bitmap??

> +
> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
> +{
> +	unsigned long skip = iter->offset;
> +
> +	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
> +}

It would help if this were defined above it's usage above.

> +
> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter)
> +{
> +	unsigned long length = iova_bitmap_length(iter);
> +
> +	iter->offset += iova_bitmap_iova_to_index(iter, length);

Again, fudging an index count based on bogus index value.

> +}
> +
> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +
> +	if (dirty->npages)
> +		unpin_user_pages(dirty->pages, dirty->npages);

dirty->npages = 0;?

> +}
> +
> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +	unsigned long npages;
> +	u64 __user *addr;
> +	long ret;
> +
> +	npages = DIV_ROUND_UP((iter->count - iter->offset) *
> +			      sizeof(*iter->data), PAGE_SIZE);
> +	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
> +	addr = iter->data + iter->offset;
> +	ret = pin_user_pages_fast((unsigned long)addr, npages,
> +				  FOLL_WRITE, dirty->pages);
> +	if (ret <= 0)
> +		return ret;
> +
> +	dirty->npages = (unsigned long)ret;
> +	dirty->iova = iova_bitmap_iova(iter);
> +	dirty->start_offset = offset_in_page(addr);
> +	return 0;
> +}
> +
> +void iova_bitmap_init(struct iova_bitmap *bitmap,
> +		      unsigned long base, unsigned long pgshift)
> +{
> +	memset(bitmap, 0, sizeof(*bitmap));
> +	bitmap->iova = base;
> +	bitmap->pgshift = pgshift;
> +}
> +
> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
> +			     unsigned long iova,
> +			     unsigned long length)
> +{
> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
> +
> +	nbits = max(1UL, length >> dirty->pgshift);
> +	offset = (iova - dirty->iova) >> dirty->pgshift;
> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
> +	start_offset = dirty->start_offset;
> +
> +	while (nbits > 0) {
> +		kaddr = kmap_local_page(dirty->pages[idx]) + start_offset;
> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
> +		bitmap_set(kaddr, offset, size);
> +		kunmap_local(kaddr - start_offset);
> +		start_offset = offset = 0;
> +		nbits -= size;
> +		idx++;
> +	}
> +
> +	return nbits;
> +}
> +EXPORT_SYMBOL_GPL(iova_bitmap_set);
> +
> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
> new file mode 100644
> index 000000000000..c474c351634a
> --- /dev/null
> +++ b/include/linux/iova_bitmap.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2022, Oracle and/or its affiliates.
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#ifndef _IOVA_BITMAP_H_
> +#define _IOVA_BITMAP_H_
> +
> +#include <linux/highmem.h>
> +#include <linux/mm.h>
> +#include <linux/uio.h>
> +
> +struct iova_bitmap {
> +	unsigned long iova;
> +	unsigned long pgshift;
> +	unsigned long start_offset;
> +	unsigned long npages;
> +	struct page **pages;
> +};
> +
> +struct iova_bitmap_iter {
> +	struct iova_bitmap dirty;
> +	u64 __user *data;
> +	size_t offset;
> +	size_t count;
> +	unsigned long iova;
> +	unsigned long length;
> +};
> +
> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
> +			  unsigned long length, u64 __user *data);
> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter);
> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
> +void iova_bitmap_init(struct iova_bitmap *bitmap,
> +		      unsigned long base, unsigned long pgshift);
> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
> +			     unsigned long iova,
> +			     unsigned long length);
> +
> +#endif

No relevant comments, no theory of operation.  I found this really
difficult to review and the page handling is still not clear to me.
I'm not willing to take on maintainership of this code under
drivers/vfio/ as is.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-19  9:19     ` Yishai Hadas
@ 2022-07-19 19:25       ` Alex Williamson
  2022-07-19 20:08         ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2022-07-19 19:25 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck

On Tue, 19 Jul 2022 12:19:25 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 19/07/2022 1:30, Alex Williamson wrote:
> > On Thu, 14 Jul 2022 11:12:46 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >  
> >> Introduce the DMA logging feature support in the vfio core layer.
> >>
> >> It includes the processing of the device start/stop/report DMA logging
> >> UAPIs and calling the relevant driver 'op' to do the work.
> >>
> >> Specifically,
> >> Upon start, the core translates the given input ranges into an interval
> >> tree, checks for unexpected overlapping, non aligned ranges and then
> >> pass the translated input to the driver for start tracking the given
> >> ranges.
> >>
> >> Upon report, the core translates the given input user space bitmap and
> >> page size into an IOVA kernel bitmap iterator. Then it iterates it and
> >> call the driver to set the corresponding bits for the dirtied pages in a
> >> specific IOVA range.
> >>
> >> Upon stop, the driver is called to stop the previous started tracking.
> >>
> >> The next patches from the series will introduce the mlx5 driver
> >> implementation for the logging ops.
> >>
> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >> ---
> >>   drivers/vfio/Kconfig             |   1 +
> >>   drivers/vfio/pci/vfio_pci_core.c |   5 +
> >>   drivers/vfio/vfio_main.c         | 161 +++++++++++++++++++++++++++++++
> >>   include/linux/vfio.h             |  21 +++-
> >>   4 files changed, 186 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> >> index 6130d00252ed..86c381ceb9a1 100644
> >> --- a/drivers/vfio/Kconfig
> >> +++ b/drivers/vfio/Kconfig
> >> @@ -3,6 +3,7 @@ menuconfig VFIO
> >>   	tristate "VFIO Non-Privileged userspace driver framework"
> >>   	select IOMMU_API
> >>   	select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
> >> +	select INTERVAL_TREE
> >>   	help
> >>   	  VFIO provides a framework for secure userspace device drivers.
> >>   	  See Documentation/driver-api/vfio.rst for more details.
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index 2efa06b1fafa..b6dabf398251 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -1862,6 +1862,11 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
> >>   			return -EINVAL;
> >>   	}
> >>   
> >> +	if (vdev->vdev.log_ops && !(vdev->vdev.log_ops->log_start &&
> >> +	    vdev->vdev.log_ops->log_stop &&
> >> +	    vdev->vdev.log_ops->log_read_and_clear))
> >> +		return -EINVAL;
> >> +
> >>   	/*
> >>   	 * Prevent binding to PFs with VFs enabled, the VFs might be in use
> >>   	 * by the host or other users.  We cannot capture the VFs if they
> >> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> >> index bd84ca7c5e35..2414d827e3c8 100644
> >> --- a/drivers/vfio/vfio_main.c
> >> +++ b/drivers/vfio/vfio_main.c
> >> @@ -32,6 +32,8 @@
> >>   #include <linux/vfio.h>
> >>   #include <linux/wait.h>
> >>   #include <linux/sched/signal.h>
> >> +#include <linux/interval_tree.h>
> >> +#include <linux/iova_bitmap.h>
> >>   #include "vfio.h"
> >>   
> >>   #define DRIVER_VERSION	"0.3"
> >> @@ -1603,6 +1605,153 @@ static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
> >>   	return 0;
> >>   }
> >>   
> >> +#define LOG_MAX_RANGES 1024
> >> +
> >> +static int
> >> +vfio_ioctl_device_feature_logging_start(struct vfio_device *device,
> >> +					u32 flags, void __user *arg,
> >> +					size_t argsz)
> >> +{
> >> +	size_t minsz =
> >> +		offsetofend(struct vfio_device_feature_dma_logging_control,
> >> +			    ranges);
> >> +	struct vfio_device_feature_dma_logging_range __user *ranges;
> >> +	struct vfio_device_feature_dma_logging_control control;
> >> +	struct vfio_device_feature_dma_logging_range range;
> >> +	struct rb_root_cached root = RB_ROOT_CACHED;
> >> +	struct interval_tree_node *nodes;
> >> +	u32 nnodes;
> >> +	int i, ret;
> >> +
> >> +	if (!device->log_ops)
> >> +		return -ENOTTY;
> >> +
> >> +	ret = vfio_check_feature(flags, argsz,
> >> +				 VFIO_DEVICE_FEATURE_SET,
> >> +				 sizeof(control));
> >> +	if (ret != 1)
> >> +		return ret;
> >> +
> >> +	if (copy_from_user(&control, arg, minsz))
> >> +		return -EFAULT;
> >> +
> >> +	nnodes = control.num_ranges;
> >> +	if (!nnodes || nnodes > LOG_MAX_RANGES)
> >> +		return -EINVAL;  
> > The latter looks more like an -E2BIG errno.  
> 
> OK
> 
> > This is a hard coded
> > limit, but what are the heuristics?  Can a user introspect the limit?
> > Thanks,
> >
> > Alex  
> 
> This hard coded value just comes to prevent user space from exploding 
> kernel memory allocation.

Of course.

> We don't really expect user space to hit this limit, the RAM in QEMU is 
> divided today to around ~12 ranges as we saw so far in our evaluation.

There can be far more for vIOMMU use cases or non-QEMU drivers.

> We may also expect user space to combine contiguous ranges to a single 
> range or in the worst case even to combine non contiguous ranges to a 
> single range.

Why do we expect that from users?
 
> We can consider moving this hard-coded value to be part of the UAPI 
> header, although, not sure that this is really a must.
> 
> What do you think ?

We're looking at a very narrow use case with implicit assumptions about
the behavior of the user driver.  Some of those assumptions need to be
exposed via the uAPI so that userspace can make reasonable choices.
Thanks,

Alex

> >> +
> >> +	ranges = u64_to_user_ptr(control.ranges);
> >> +	nodes = kmalloc_array(nnodes, sizeof(struct interval_tree_node),
> >> +			      GFP_KERNEL);
> >> +	if (!nodes)
> >> +		return -ENOMEM;
> >> +
> >> +	for (i = 0; i < nnodes; i++) {
> >> +		if (copy_from_user(&range, &ranges[i], sizeof(range))) {
> >> +			ret = -EFAULT;
> >> +			goto end;
> >> +		}
> >> +		if (!IS_ALIGNED(range.iova, control.page_size) ||
> >> +		    !IS_ALIGNED(range.length, control.page_size)) {
> >> +			ret = -EINVAL;
> >> +			goto end;
> >> +		}
> >> +		nodes[i].start = range.iova;
> >> +		nodes[i].last = range.iova + range.length - 1;
> >> +		if (interval_tree_iter_first(&root, nodes[i].start,
> >> +					     nodes[i].last)) {
> >> +			/* Range overlapping */
> >> +			ret = -EINVAL;
> >> +			goto end;
> >> +		}
> >> +		interval_tree_insert(nodes + i, &root);
> >> +	}
> >> +
> >> +	ret = device->log_ops->log_start(device, &root, nnodes,
> >> +					 &control.page_size);
> >> +	if (ret)
> >> +		goto end;
> >> +
> >> +	if (copy_to_user(arg, &control, sizeof(control))) {
> >> +		ret = -EFAULT;
> >> +		device->log_ops->log_stop(device);
> >> +	}
> >> +
> >> +end:
> >> +	kfree(nodes);
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +vfio_ioctl_device_feature_logging_stop(struct vfio_device *device,
> >> +				       u32 flags, void __user *arg,
> >> +				       size_t argsz)
> >> +{
> >> +	int ret;
> >> +
> >> +	if (!device->log_ops)
> >> +		return -ENOTTY;
> >> +
> >> +	ret = vfio_check_feature(flags, argsz,
> >> +				 VFIO_DEVICE_FEATURE_SET, 0);
> >> +	if (ret != 1)
> >> +		return ret;
> >> +
> >> +	return device->log_ops->log_stop(device);
> >> +}
> >> +
> >> +static int
> >> +vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
> >> +					 u32 flags, void __user *arg,
> >> +					 size_t argsz)
> >> +{
> >> +	size_t minsz =
> >> +		offsetofend(struct vfio_device_feature_dma_logging_report,
> >> +			    bitmap);
> >> +	struct vfio_device_feature_dma_logging_report report;
> >> +	struct iova_bitmap_iter iter;
> >> +	int ret;
> >> +
> >> +	if (!device->log_ops)
> >> +		return -ENOTTY;
> >> +
> >> +	ret = vfio_check_feature(flags, argsz,
> >> +				 VFIO_DEVICE_FEATURE_GET,
> >> +				 sizeof(report));
> >> +	if (ret != 1)
> >> +		return ret;
> >> +
> >> +	if (copy_from_user(&report, arg, minsz))
> >> +		return -EFAULT;
> >> +
> >> +	if (report.page_size < PAGE_SIZE)
> >> +		return -EINVAL;
> >> +
> >> +	iova_bitmap_init(&iter.dirty, report.iova, ilog2(report.page_size));
> >> +	ret = iova_bitmap_iter_init(&iter, report.iova, report.length,
> >> +				    u64_to_user_ptr(report.bitmap));
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	for (; !iova_bitmap_iter_done(&iter);
> >> +	     iova_bitmap_iter_advance(&iter)) {
> >> +		ret = iova_bitmap_iter_get(&iter);
> >> +		if (ret)
> >> +			break;
> >> +
> >> +		ret = device->log_ops->log_read_and_clear(device,
> >> +			iova_bitmap_iova(&iter),
> >> +			iova_bitmap_length(&iter), &iter.dirty);
> >> +
> >> +		iova_bitmap_iter_put(&iter);
> >> +
> >> +		if (ret)
> >> +			break;
> >> +	}
> >> +
> >> +	iova_bitmap_iter_free(&iter);
> >> +	return ret;
> >> +}
> >> +
> >>   static int vfio_ioctl_device_feature(struct vfio_device *device,
> >>   				     struct vfio_device_feature __user *arg)
> >>   {
> >> @@ -1636,6 +1785,18 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
> >>   		return vfio_ioctl_device_feature_mig_device_state(
> >>   			device, feature.flags, arg->data,
> >>   			feature.argsz - minsz);
> >> +	case VFIO_DEVICE_FEATURE_DMA_LOGGING_START:
> >> +		return vfio_ioctl_device_feature_logging_start(
> >> +			device, feature.flags, arg->data,
> >> +			feature.argsz - minsz);
> >> +	case VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP:
> >> +		return vfio_ioctl_device_feature_logging_stop(
> >> +			device, feature.flags, arg->data,
> >> +			feature.argsz - minsz);
> >> +	case VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT:
> >> +		return vfio_ioctl_device_feature_logging_report(
> >> +			device, feature.flags, arg->data,
> >> +			feature.argsz - minsz);
> >>   	default:
> >>   		if (unlikely(!device->ops->device_feature))
> >>   			return -EINVAL;
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index 4d26e149db81..feed84d686ec 100644
> >> --- a/include/linux/vfio.h
> >> +++ b/include/linux/vfio.h
> >> @@ -14,6 +14,7 @@
> >>   #include <linux/workqueue.h>
> >>   #include <linux/poll.h>
> >>   #include <uapi/linux/vfio.h>
> >> +#include <linux/iova_bitmap.h>
> >>   
> >>   struct kvm;
> >>   
> >> @@ -33,10 +34,11 @@ struct vfio_device {
> >>   	struct device *dev;
> >>   	const struct vfio_device_ops *ops;
> >>   	/*
> >> -	 * mig_ops is a static property of the vfio_device which must be set
> >> -	 * prior to registering the vfio_device.
> >> +	 * mig_ops/log_ops is a static property of the vfio_device which must
> >> +	 * be set prior to registering the vfio_device.
> >>   	 */
> >>   	const struct vfio_migration_ops *mig_ops;
> >> +	const struct vfio_log_ops *log_ops;
> >>   	struct vfio_group *group;
> >>   	struct vfio_device_set *dev_set;
> >>   	struct list_head dev_set_list;
> >> @@ -104,6 +106,21 @@ struct vfio_migration_ops {
> >>   				   enum vfio_device_mig_state *curr_state);
> >>   };
> >>   
> >> +/**
> >> + * @log_start: Optional callback to ask the device start DMA logging.
> >> + * @log_stop: Optional callback to ask the device stop DMA logging.
> >> + * @log_read_and_clear: Optional callback to ask the device read
> >> + *         and clear the dirty DMAs in some given range.
> >> + */
> >> +struct vfio_log_ops {
> >> +	int (*log_start)(struct vfio_device *device,
> >> +		struct rb_root_cached *ranges, u32 nnodes, u64 *page_size);
> >> +	int (*log_stop)(struct vfio_device *device);
> >> +	int (*log_read_and_clear)(struct vfio_device *device,
> >> +		unsigned long iova, unsigned long length,
> >> +		struct iova_bitmap *dirty);
> >> +};
> >> +
> >>   /**
> >>    * vfio_check_feature - Validate user input for the VFIO_DEVICE_FEATURE ioctl
> >>    * @flags: Arg from the device_feature op  
> 
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-19  7:49     ` Yishai Hadas
@ 2022-07-19 19:57       ` Alex Williamson
  2022-07-19 20:18         ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2022-07-19 19:57 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck, Kirti Wankhede

On Tue, 19 Jul 2022 10:49:42 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 19/07/2022 1:29, Alex Williamson wrote:
> > On Thu, 14 Jul 2022 11:12:43 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >  
> >> DMA logging allows a device to internally record what DMAs the device is
> >> initiating and report them back to userspace. It is part of the VFIO
> >> migration infrastructure that allows implementing dirty page tracking
> >> during the pre copy phase of live migration. Only DMA WRITEs are logged,
> >> and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> >>
> >> This patch introduces the DMA logging involved uAPIs.
> >>
> >> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
> >>
> >> It exposes a PROBE option to detect if the device supports DMA logging.
> >> It exposes a SET option to start device DMA logging in given IOVAs
> >> ranges.
> >> It exposes a SET option to stop device DMA logging that was previously
> >> started.
> >> It exposes a GET option to read back and clear the device DMA log.
> >>
> >> Extra details exist as part of vfio.h per a specific option.  
> >
> > Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
> > potentially other devices that might make use of this or is everyone
> > else waiting for IOMMU based dirty tracking?
> >
> >     
> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >> ---
> >>   include/uapi/linux/vfio.h | 79 +++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 79 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 733a1cddde30..81475c3e7c92 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -986,6 +986,85 @@ enum vfio_device_mig_state {
> >>   	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>   };
> >>   
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
> >> + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
> >> + * DMA logging.
> >> + *
> >> + * DMA logging allows a device to internally record what DMAs the device is
> >> + * initiating and report them back to userspace. It is part of the VFIO
> >> + * migration infrastructure that allows implementing dirty page tracking
> >> + * during the pre copy phase of live migration. Only DMA WRITEs are logged,
> >> + * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> >> + *
> >> + * When DMA logging is started a range of IOVAs to monitor is provided and the
> >> + * device can optimize its logging to cover only the IOVA range given. Each
> >> + * DMA that the device initiates inside the range will be logged by the device
> >> + * for later retrieval.
> >> + *
> >> + * page_size is an input that hints what tracking granularity the device
> >> + * should try to achieve. If the device cannot do the hinted page size then it
> >> + * should pick the next closest page size it supports. On output the device
> >> + * will return the page size it selected.
> >> + *
> >> + * ranges is a pointer to an array of
> >> + * struct vfio_device_feature_dma_logging_range.
> >> + */
> >> +struct vfio_device_feature_dma_logging_control {
> >> +	__aligned_u64 page_size;
> >> +	__u32 num_ranges;
> >> +	__u32 __reserved;
> >> +	__aligned_u64 ranges;
> >> +};
> >> +
> >> +struct vfio_device_feature_dma_logging_range {
> >> +	__aligned_u64 iova;
> >> +	__aligned_u64 length;
> >> +};
> >> +
> >> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
> >> +
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
> >> + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
> >> + */
> >> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4
> >> +
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
> >> + *
> >> + * Query the device's DMA log for written pages within the given IOVA range.
> >> + * During querying the log is cleared for the IOVA range.
> >> + *
> >> + * bitmap is a pointer to an array of u64s that will hold the output bitmap
> >> + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
> >> + * is given by:
> >> + *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
> >> + *
> >> + * The input page_size can be any power of two value and does not have to
> >> + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
> >> + * will format its internal logging to match the reporting page size, possibly
> >> + * by replicating bits if the internal page size is lower than requested.
> >> + *
> >> + * Bits will be updated in bitmap using atomic or to allow userspace to

s/or/OR/

> >> + * combine bitmaps from multiple trackers together. Therefore userspace must
> >> + * zero the bitmap before doing any reports.  
> > Somewhat confusing, perhaps "between report sets"?  
> 
> The idea was that the driver just turns on its own dirty bits and 
> doesn't touch others.

Right, we can aggregate dirty bits from multiple devices into a single
bitmap.

> Do you suggest the below ?
> 
> "Therefore userspace must zero the bitmap between report sets".

It may be best to simply drop this guidance, we don't need to presume
the user algorithm, we only need to make it apparent that
LOGGING_REPORT will only set bits in the bitmap and never clear or
preform any initialization of the user provided bitmap.

> >> + *
> >> + * If any error is returned userspace should assume that the dirty log is
> >> + * corrupted and restart.  
> > Restart what?  The user can't just zero the bitmap and retry, dirty
> > information at the device has been lost.  
> 
> Right
> 
> >   Are we suggesting they stop
> > DMA logging and restart it, which sounds a lot like failing a migration
> > and starting over.  Or could the user gratuitously mark the bitmap
> > fully dirty and a subsequent logging report iteration might work?
> > Thanks,
> >
> > Alex  
> 
> An error at that step is not expected and might be fatal.
> 
> User space can consider marking all as dirty and continue with that 
> approach for next iterations, maybe even without calling the driver.
> 
> Alternatively, user space can abort the migration and retry later on.
> 
> We can come with some rephrasing as of the above.
> 
> What do you think ?

If userspace needs to consider the bitmap undefined for any errno,
that's a pretty serious usage restriction that may negate the
practical utility of atomically OR'ing in dirty bits.  We can certainly
have EINVAL, ENOTTY, EFAULT, E2BIG, ENOMEM conditions that don't result
in a corrupted/undefined bitmap, right?  Maybe some of those result in
an incomplete bitmap, but how does the bitmap actually get corrupted?
It seems like such a condition should be pretty narrowly defined and
separate from errors resulting in an incomplete bitmap, maybe we'd
reserve -EIO for such a case.  The driver itself can also gratuitously
mark ranges dirty itself if it loses sync with the device, and can
probably do so at a much more accurate granularity than userspace.
Thanks,

Alex

> >> + *
> >> + * If DMA logging is not enabled, an error will be returned.
> >> + *
> >> + */
> >> +struct vfio_device_feature_dma_logging_report {
> >> +	__aligned_u64 iova;
> >> +	__aligned_u64 length;
> >> +	__aligned_u64 page_size;
> >> +	__aligned_u64 bitmap;
> >> +};
> >> +
> >> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
> >> +
> >>   /* -------- API for Type1 VFIO IOMMU -------- */
> >>   
> >>   /**  
> 
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-19 19:25       ` Alex Williamson
@ 2022-07-19 20:08         ` Jason Gunthorpe
  2022-07-21  8:54           ` Tian, Kevin
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-19 20:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, saeedm, kvm, netdev, kuba, kevin.tian,
	joao.m.martins, leonro, maorg, cohuck

On Tue, Jul 19, 2022 at 01:25:14PM -0600, Alex Williamson wrote:

> > We don't really expect user space to hit this limit, the RAM in QEMU is 
> > divided today to around ~12 ranges as we saw so far in our evaluation.
> 
> There can be far more for vIOMMU use cases or non-QEMU drivers.

Not really, it isn't dynamic so vIOMMU has to decide what it wants to
track up front. It would never make sense to track based on what is
currently mapped. So it will be some small list, probably a big linear
chunk of the IOVA space.

No idea why non-qemu cases would need to be so different?

> We're looking at a very narrow use case with implicit assumptions about
> the behavior of the user driver.  Some of those assumptions need to be
> exposed via the uAPI so that userspace can make reasonable choices.

I think we need to see a clear use case for more than a few 10's of
ranges before we complicate things. I don't see one. If one does crop
up someday it is easy to add a new query, or some other behavior.

Remember, the devices can't handle huge numbers of ranges anyhow, so
any userspace should not be designed to have more than a few tens in
the first place.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-19 19:57       ` Alex Williamson
@ 2022-07-19 20:18         ` Jason Gunthorpe
  0 siblings, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-19 20:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, saeedm, kvm, netdev, kuba, kevin.tian,
	joao.m.martins, leonro, maorg, cohuck, Kirti Wankhede

On Tue, Jul 19, 2022 at 01:57:14PM -0600, Alex Williamson wrote:

> If userspace needs to consider the bitmap undefined for any errno,
> that's a pretty serious usage restriction that may negate the
> practical utility of atomically OR'ing in dirty bits.  

If any report fails it means qemu has lost the ability to dirty track,
so it doesn't really matter if one chunk has gotten lost - we can't
move forward with incomplete dirty data.

Error recovery is to consider all memory dirty and try to restart the
dirty tracking, or to abort/restart the whole migration.

Worrying about the integrity of a single bitmap chunk doesn't seem
worthwhile given that outcome.

Yes, there are cases where things are more deterministic, but it is
not useful information that userspace can do anything with. It can't
just call report again and expect it will work the 2nd time, for
instance.

We do not expect failures here, lets not overdesign things.

> reserve -EIO for such a case.  The driver itself can also gratuitously
> mark ranges dirty itself if it loses sync with the device, and can
> probably do so at a much more accurate granularity than userspace.

A driver that implements such a handling shouldn't return an error
code in the first place.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support
  2022-07-19 19:01   ` Alex Williamson
@ 2022-07-20  1:57     ` Joao Martins
  2022-07-20 16:47       ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Joao Martins @ 2022-07-20  1:57 UTC (permalink / raw)
  To: Alex Williamson, Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro, maorg, cohuck

On 7/19/22 20:01, Alex Williamson wrote:
> On Thu, 14 Jul 2022 11:12:45 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
>> From: Joao Martins <joao.m.martins@oracle.com>
>>
>> The new facility adds a bunch of wrappers that abstract how an IOVA
>> range is represented in a bitmap that is granulated by a given
>> page_size. So it translates all the lifting of dealing with user
>> pointers into its corresponding kernel addresses backing said user
>> memory into doing finally the bitmap ops to change various bits.
>>
>> The formula for the bitmap is:
>>
>>    data[(iova / page_size) / 64] & (1ULL << (iova % 64))
>>
>> Where 64 is the number of bits in a unsigned long (depending on arch)
>>
>> An example usage of these helpers for a given @iova, @page_size, @length
>> and __user @data:
>>
>> 	iova_bitmap_init(&iter.dirty, iova, __ffs(page_size));
>> 	ret = iova_bitmap_iter_init(&iter, iova, length, data);
> 
> Why are these separate functions given this use case?
> 
Because one structure (struct iova_bitmap) represents the user-facing
part i.e. the one setting dirty bits (e.g. the iommu driver or mlx5 vfio)
and the other represents the iterator of said IOVA bitmap. The iterator
does all the work while the bitmap user is the one marshalling dirty
bits from vendor structure into the iterator-prepared iova_bitmap
(using iova_bitmap_set).

It made sense to me to separate the two initializations, but in pratice
both iterator cases (IOMMUFD and VFIO) are initializing in the same way.
Maybe better merge them for now, considering that it is redundant to retain
this added complexity.

>> 	if (ret)
>> 		return -ENOMEM;
>>
>> 	for (; !iova_bitmap_iter_done(&iter);
>> 	     iova_bitmap_iter_advance(&iter)) {
>> 		ret = iova_bitmap_iter_get(&iter);
>> 		if (ret)
>> 			break;
>> 		if (dirty)
>> 			iova_bitmap_set(iova_bitmap_iova(&iter),
>> 					iova_bitmap_iova_length(&iter),
>> 					&iter.dirty);
>>
>> 		iova_bitmap_iter_put(&iter);
>>
>> 		if (ret)
>> 			break;
> 
> This break is unreachable.
> 
I'll remove it.

>> 	}
>>
>> 	iova_bitmap_iter_free(&iter);
>>
>> The facility is intended to be used for user bitmaps representing
>> dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>>  drivers/vfio/Makefile       |   6 +-
>>  drivers/vfio/iova_bitmap.c  | 164 ++++++++++++++++++++++++++++++++++++
>>  include/linux/iova_bitmap.h |  46 ++++++++++
>>  3 files changed, 214 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/vfio/iova_bitmap.c
>>  create mode 100644 include/linux/iova_bitmap.h
>>
>> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
>> index 1a32357592e3..1d6cad32d366 100644
>> --- a/drivers/vfio/Makefile
>> +++ b/drivers/vfio/Makefile
>> @@ -1,9 +1,11 @@
>>  # SPDX-License-Identifier: GPL-2.0
>>  vfio_virqfd-y := virqfd.o
>>  
>> -vfio-y += vfio_main.o
>> -
>>  obj-$(CONFIG_VFIO) += vfio.o
>> +
>> +vfio-y := vfio_main.o \
>> +          iova_bitmap.o \
>> +
>>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
>>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
>> new file mode 100644
>> index 000000000000..9ad1533a6aec
>> --- /dev/null
>> +++ b/drivers/vfio/iova_bitmap.c
>> @@ -0,0 +1,164 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2022, Oracle and/or its affiliates.
>> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +
>> +#include <linux/iova_bitmap.h>
>> +
>> +static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
>> +					       unsigned long iova_length)
> 
> If we have an iova-to-index function, why do we pass it a length?  That
> seems to be conflating the use cases where the caller is trying to
> determine the last index for a range with the actual implementation of
> this helper.
> 

see below

>> +{
>> +	unsigned long pgsize = 1 << iter->dirty.pgshift;
>> +
>> +	return DIV_ROUND_UP(iova_length, BITS_PER_TYPE(*iter->data) * pgsize);
> 
> ROUND_UP here doesn't make sense to me and is not symmetric with the
> below index-to-iova function.  For example an iova of 0x1000 give me an
> index of 1, but index of 1 gives me an iova of 0x40000.  Does this code
> work??
> 
It does work. The functions aren't actually symmetric, and iova_to_index() is returning
the number of elements based on bits-per-u64/page_size for a IOVA length. And it was me
being defensive to avoid having to fixup to iovas given that all computations can be done
with lengths/nr-elements.

I have been reworking IOMMUFD original version this originated and these are remnants of
working over chunks of bitmaps/iova rather than treating the bitmap as an array. But the
latter is where I was aiming at in terms of structure. I should make these symmetric and
actually return an index and fully adhere to that symmetry as convention.

Thus will remove the DIV_ROUND_UP here, switch it to work under an IOVA instead of length
and adjust the necessary off-by-one and +1 in its respective call sites. Sorry for the
confusion this has caused.

>> +}
>> +
>> +static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
>> +					       unsigned long index)
>> +{
>> +	unsigned long pgshift = iter->dirty.pgshift;
>> +
>> +	return (index * sizeof(*iter->data) * BITS_PER_BYTE) << pgshift;
>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Isn't that BITS_PER_TYPE(*iter->data), just as in the previous function?
> 
Yeap, I'll switch to that.

>> +}
>> +
>> +static unsigned long iova_bitmap_iter_left(struct iova_bitmap_iter *iter)
> 
> I think this is trying to find "remaining" whereas "left" can be
> confused with a direction.
> 
Yes, it was bad english on my end. I'll replace it with @remaining.

>> +{
>> +	unsigned long left = iter->count - iter->offset;
>> +
>> +	left = min_t(unsigned long, left,
>> +		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));
> 
> Ugh, dirty.npages is zero'd on bitmap init, allocated on get and left
> with stale data on put.  This really needs some documentation/theory of
> operation.
> 

So the get and put are always paired, and their function is to pin a chunk
of the bitmap (up to 2M which is how many struct pages can fit in one
base page) and initialize the iova_bitmap with the info on what the bitmap
pages represent in terms of which IOVA space.

So while @npages is left stale after put(), its value is only ever useful
after get() (i.e. pinning). And its purpose is to cap the max pages
we can access from the bitmap for also e.g. calculating
iova_bitmap_length()/iova_bitmap_iter_left()
or advancing the iterator.

>> +
>> +	return left;
>> +}
>> +
>> +/*
>> + * Input argument of number of bits to bitmap_set() is unsigned integer, which
>> + * further casts to signed integer for unaligned multi-bit operation,
>> + * __bitmap_set().
>> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
>> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
>> + * system.
>> + */
> 
> This is all true and familiar, but what's it doing here?  The type1
> code this comes from uses this to justify some #defines that are used
> to sanitize input.  I see no such enforcement in this code.  The only
> comment in this whole patch and it seems irrelevant.
> 
This was previously related to macros I had here that serve the same purpose
as the ones in VFIO, but the same said validation was made in some other way
and by distraction I left this comment stale. I'll remove it.

>> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
>> +			  unsigned long iova, unsigned long length,
>> +			  u64 __user *data)
>> +{
>> +	struct iova_bitmap *dirty = &iter->dirty;
>> +
>> +	iter->data = data;
>> +	iter->offset = 0;
>> +	iter->count = iova_bitmap_iova_to_index(iter, length);
> 
> If this works, it's because the DIV_ROUND_UP above accounted for what
> should have been and index-to-count fixup here, ie. add one.
> 
As mentioned earlier, I'll change that to the suggestion above.

>> +	iter->iova = iova;
>> +	iter->length = length;
>> +	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
>> +
>> +	return !dirty->pages ? -ENOMEM : 0;
>> +}
>> +
>> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
>> +{
>> +	struct iova_bitmap *dirty = &iter->dirty;
>> +
>> +	if (dirty->pages) {
>> +		free_page((unsigned long)dirty->pages);
>> +		dirty->pages = NULL;
>> +	}
>> +}
>> +
>> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter)
>> +{
>> +	return iter->offset >= iter->count;
>> +}
>> +
>> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)
>> +{
>> +	unsigned long max_iova = iter->dirty.iova + iter->length;
>> +	unsigned long left = iova_bitmap_iter_left(iter);
>> +	unsigned long iova = iova_bitmap_iova(iter);
>> +
>> +	left = iova_bitmap_index_to_iova(iter, left);
> 
> @left is first used for number of indexes and then for an iova range :(
> 
I was trying to avoid an extra variable and an extra long line.

>> +	if (iova + left > max_iova)
>> +		left -= ((iova + left) - max_iova);
>> +
>> +	return left;
>> +}
> 
> IIUC, this is returning the iova free space in the bitmap, not the
> length of the bitmap??
> 
This is essentially representing your bitmap working set IOW the length
of the *pinned* bitmap. Not the size of the whole bitmap.

>> +
>> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
>> +{
>> +	unsigned long skip = iter->offset;
>> +
>> +	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
>> +}
> 
> It would help if this were defined above it's usage above.
> 
I'll move it.

>> +
>> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter)
>> +{
>> +	unsigned long length = iova_bitmap_length(iter);
>> +
>> +	iter->offset += iova_bitmap_iova_to_index(iter, length);
> 
> Again, fudging an index count based on bogus index value.
> 
As mentioned earlier, I'll change that iova_bitmap_iova_to_index()
to return an index instead of nr of elements.

>> +}
>> +
>> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
>> +{
>> +	struct iova_bitmap *dirty = &iter->dirty;
>> +
>> +	if (dirty->npages)
>> +		unpin_user_pages(dirty->pages, dirty->npages);
> 
> dirty->npages = 0;?
> 
Sadly no, because after iova_bitmap_iter_put() we will call
iova_bitmap_iter_advance() to go to the next chunk of the bitmap
(i.e. the next 64G of IOVA, or IOW the next 2M of bitmap memory).

I could remove explicit calls to iova_bitmap_iter_{get,put}()
while making them internal only and merge it in iova_bitmap_iter_advance()
and iova_bimap_iter_init. This should a bit simpler for API user
and I would be able to clear npages here. Let me see how this looks.

>> +}
>> +
>> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
>> +{
>> +	struct iova_bitmap *dirty = &iter->dirty;
>> +	unsigned long npages;
>> +	u64 __user *addr;
>> +	long ret;
>> +
>> +	npages = DIV_ROUND_UP((iter->count - iter->offset) *
>> +			      sizeof(*iter->data), PAGE_SIZE);
>> +	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
>> +	addr = iter->data + iter->offset;
>> +	ret = pin_user_pages_fast((unsigned long)addr, npages,
>> +				  FOLL_WRITE, dirty->pages);
>> +	if (ret <= 0)
>> +		return ret;
>> +
>> +	dirty->npages = (unsigned long)ret;
>> +	dirty->iova = iova_bitmap_iova(iter);
>> +	dirty->start_offset = offset_in_page(addr);
>> +	return 0;
>> +}
>> +
>> +void iova_bitmap_init(struct iova_bitmap *bitmap,
>> +		      unsigned long base, unsigned long pgshift)
>> +{
>> +	memset(bitmap, 0, sizeof(*bitmap));
>> +	bitmap->iova = base;
>> +	bitmap->pgshift = pgshift;
>> +}
>> +
>> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
>> +			     unsigned long iova,
>> +			     unsigned long length)
>> +{
>> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
>> +
>> +	nbits = max(1UL, length >> dirty->pgshift);
>> +	offset = (iova - dirty->iova) >> dirty->pgshift;
>> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
>> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
>> +	start_offset = dirty->start_offset;
>> +
>> +	while (nbits > 0) {
>> +		kaddr = kmap_local_page(dirty->pages[idx]) + start_offset;
>> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
>> +		bitmap_set(kaddr, offset, size);
>> +		kunmap_local(kaddr - start_offset);
>> +		start_offset = offset = 0;
>> +		nbits -= size;
>> +		idx++;
>> +	}
>> +
>> +	return nbits;
>> +}
>> +EXPORT_SYMBOL_GPL(iova_bitmap_set);
>> +
>> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
>> new file mode 100644
>> index 000000000000..c474c351634a
>> --- /dev/null
>> +++ b/include/linux/iova_bitmap.h
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (c) 2022, Oracle and/or its affiliates.
>> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +
>> +#ifndef _IOVA_BITMAP_H_
>> +#define _IOVA_BITMAP_H_
>> +
>> +#include <linux/highmem.h>
>> +#include <linux/mm.h>
>> +#include <linux/uio.h>
>> +
>> +struct iova_bitmap {
>> +	unsigned long iova;
>> +	unsigned long pgshift;
>> +	unsigned long start_offset;
>> +	unsigned long npages;
>> +	struct page **pages;
>> +};
>> +
>> +struct iova_bitmap_iter {
>> +	struct iova_bitmap dirty;
>> +	u64 __user *data;
>> +	size_t offset;
>> +	size_t count;
>> +	unsigned long iova;
>> +	unsigned long length;
>> +};
>> +
>> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
>> +			  unsigned long length, u64 __user *data);
>> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
>> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
>> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
>> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
>> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
>> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter);
>> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
>> +void iova_bitmap_init(struct iova_bitmap *bitmap,
>> +		      unsigned long base, unsigned long pgshift);
>> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
>> +			     unsigned long iova,
>> +			     unsigned long length);
>> +
>> +#endif
> 
> No relevant comments, no theory of operation.  I found this really
> difficult to review and the page handling is still not clear to me.
> I'm not willing to take on maintainership of this code under
> drivers/vfio/ as is. 

Sorry for the lack of comments/docs and lack of clearity in some of the
functions. I'll document all functions/fields and add a comment bloc at
the top explaining the theory on how it should be used/works, alongside
the improvements you suggested above.

Meanwhile what is less clear for you on the page handling? We are essentially
calculating the number of pages based of @offset and @count and then preping
the iova_bitmap (@dirty) with the base IOVA and page offset. iova_bitmap_set()
then computes where is the should start setting bits, and then it kmap() each page
and sets the said bits. So far I am not caching kmap() kaddr,
so the majority of iova_bitmap_set() complexity comes from iterating over number
of bits to kmap and accounting to the offset that user bitmap address had.

	Joao

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support
  2022-07-20  1:57     ` Joao Martins
@ 2022-07-20 16:47       ` Alex Williamson
  2022-07-20 17:27         ` Jason Gunthorpe
  2022-07-20 18:16         ` Joao Martins
  0 siblings, 2 replies; 52+ messages in thread
From: Alex Williamson @ 2022-07-20 16:47 UTC (permalink / raw)
  To: Joao Martins
  Cc: Yishai Hadas, jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro,
	maorg, cohuck

On Wed, 20 Jul 2022 02:57:24 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/19/22 20:01, Alex Williamson wrote:
> > On Thu, 14 Jul 2022 11:12:45 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >   
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >>
> >> The new facility adds a bunch of wrappers that abstract how an IOVA
> >> range is represented in a bitmap that is granulated by a given
> >> page_size. So it translates all the lifting of dealing with user
> >> pointers into its corresponding kernel addresses backing said user
> >> memory into doing finally the bitmap ops to change various bits.
> >>
> >> The formula for the bitmap is:
> >>
> >>    data[(iova / page_size) / 64] & (1ULL << (iova % 64))
> >>
> >> Where 64 is the number of bits in a unsigned long (depending on arch)
> >>
> >> An example usage of these helpers for a given @iova, @page_size, @length
> >> and __user @data:
> >>
> >> 	iova_bitmap_init(&iter.dirty, iova, __ffs(page_size));
> >> 	ret = iova_bitmap_iter_init(&iter, iova, length, data);  
> > 
> > Why are these separate functions given this use case?
> >   
> Because one structure (struct iova_bitmap) represents the user-facing
> part i.e. the one setting dirty bits (e.g. the iommu driver or mlx5 vfio)
> and the other represents the iterator of said IOVA bitmap. The iterator
> does all the work while the bitmap user is the one marshalling dirty
> bits from vendor structure into the iterator-prepared iova_bitmap
> (using iova_bitmap_set).
> 
> It made sense to me to separate the two initializations, but in pratice
> both iterator cases (IOMMUFD and VFIO) are initializing in the same way.
> Maybe better merge them for now, considering that it is redundant to retain
> this added complexity.
> 
> >> 	if (ret)
> >> 		return -ENOMEM;
> >>
> >> 	for (; !iova_bitmap_iter_done(&iter);
> >> 	     iova_bitmap_iter_advance(&iter)) {
> >> 		ret = iova_bitmap_iter_get(&iter);
> >> 		if (ret)
> >> 			break;
> >> 		if (dirty)
> >> 			iova_bitmap_set(iova_bitmap_iova(&iter),
> >> 					iova_bitmap_iova_length(&iter),
> >> 					&iter.dirty);
> >>
> >> 		iova_bitmap_iter_put(&iter);
> >>
> >> 		if (ret)
> >> 			break;  
> > 
> > This break is unreachable.
> >   
> I'll remove it.
> 
> >> 	}
> >>
> >> 	iova_bitmap_iter_free(&iter);
> >>
> >> The facility is intended to be used for user bitmaps representing
> >> dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >> ---
> >>  drivers/vfio/Makefile       |   6 +-
> >>  drivers/vfio/iova_bitmap.c  | 164 ++++++++++++++++++++++++++++++++++++
> >>  include/linux/iova_bitmap.h |  46 ++++++++++
> >>  3 files changed, 214 insertions(+), 2 deletions(-)
> >>  create mode 100644 drivers/vfio/iova_bitmap.c
> >>  create mode 100644 include/linux/iova_bitmap.h
> >>
> >> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> >> index 1a32357592e3..1d6cad32d366 100644
> >> --- a/drivers/vfio/Makefile
> >> +++ b/drivers/vfio/Makefile
> >> @@ -1,9 +1,11 @@
> >>  # SPDX-License-Identifier: GPL-2.0
> >>  vfio_virqfd-y := virqfd.o
> >>  
> >> -vfio-y += vfio_main.o
> >> -
> >>  obj-$(CONFIG_VFIO) += vfio.o
> >> +
> >> +vfio-y := vfio_main.o \
> >> +          iova_bitmap.o \
> >> +
> >>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
> >>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> >>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> >> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
> >> new file mode 100644
> >> index 000000000000..9ad1533a6aec
> >> --- /dev/null
> >> +++ b/drivers/vfio/iova_bitmap.c
> >> @@ -0,0 +1,164 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Copyright (c) 2022, Oracle and/or its affiliates.
> >> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> >> + */
> >> +
> >> +#include <linux/iova_bitmap.h>
> >> +
> >> +static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
> >> +					       unsigned long iova_length)  
> > 
> > If we have an iova-to-index function, why do we pass it a length?  That
> > seems to be conflating the use cases where the caller is trying to
> > determine the last index for a range with the actual implementation of
> > this helper.
> >   
> 
> see below
> 
> >> +{
> >> +	unsigned long pgsize = 1 << iter->dirty.pgshift;
> >> +
> >> +	return DIV_ROUND_UP(iova_length, BITS_PER_TYPE(*iter->data) * pgsize);  
> > 
> > ROUND_UP here doesn't make sense to me and is not symmetric with the
> > below index-to-iova function.  For example an iova of 0x1000 give me an
> > index of 1, but index of 1 gives me an iova of 0x40000.  Does this code
> > work??
> >   
> It does work. The functions aren't actually symmetric, and iova_to_index() is returning
> the number of elements based on bits-per-u64/page_size for a IOVA length. And it was me
> being defensive to avoid having to fixup to iovas given that all computations can be done
> with lengths/nr-elements.
> 
> I have been reworking IOMMUFD original version this originated and these are remnants of
> working over chunks of bitmaps/iova rather than treating the bitmap as an array. But the
> latter is where I was aiming at in terms of structure. I should make these symmetric and
> actually return an index and fully adhere to that symmetry as convention.
> 
> Thus will remove the DIV_ROUND_UP here, switch it to work under an IOVA instead of length
> and adjust the necessary off-by-one and +1 in its respective call sites. Sorry for the
> confusion this has caused.
> 
> >> +}
> >> +
> >> +static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
> >> +					       unsigned long index)
> >> +{
> >> +	unsigned long pgshift = iter->dirty.pgshift;
> >> +
> >> +	return (index * sizeof(*iter->data) * BITS_PER_BYTE) << pgshift;  
> >                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > Isn't that BITS_PER_TYPE(*iter->data), just as in the previous function?
> >   
> Yeap, I'll switch to that.
> 
> >> +}
> >> +
> >> +static unsigned long iova_bitmap_iter_left(struct iova_bitmap_iter *iter)  
> > 
> > I think this is trying to find "remaining" whereas "left" can be
> > confused with a direction.
> >   
> Yes, it was bad english on my end. I'll replace it with @remaining.
> 
> >> +{
> >> +	unsigned long left = iter->count - iter->offset;
> >> +
> >> +	left = min_t(unsigned long, left,
> >> +		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));  
> > 
> > Ugh, dirty.npages is zero'd on bitmap init, allocated on get and left
> > with stale data on put.  This really needs some documentation/theory of
> > operation.
> >   
> 
> So the get and put are always paired, and their function is to pin a chunk
> of the bitmap (up to 2M which is how many struct pages can fit in one
> base page) and initialize the iova_bitmap with the info on what the bitmap
> pages represent in terms of which IOVA space.
> 
> So while @npages is left stale after put(), its value is only ever useful
> after get() (i.e. pinning). And its purpose is to cap the max pages
> we can access from the bitmap for also e.g. calculating
> iova_bitmap_length()/iova_bitmap_iter_left()
> or advancing the iterator.
> 
> >> +
> >> +	return left;
> >> +}
> >> +
> >> +/*
> >> + * Input argument of number of bits to bitmap_set() is unsigned integer, which
> >> + * further casts to signed integer for unaligned multi-bit operation,
> >> + * __bitmap_set().
> >> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
> >> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
> >> + * system.
> >> + */  
> > 
> > This is all true and familiar, but what's it doing here?  The type1
> > code this comes from uses this to justify some #defines that are used
> > to sanitize input.  I see no such enforcement in this code.  The only
> > comment in this whole patch and it seems irrelevant.
> >   
> This was previously related to macros I had here that serve the same purpose
> as the ones in VFIO, but the same said validation was made in some other way
> and by distraction I left this comment stale. I'll remove it.
> 
> >> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
> >> +			  unsigned long iova, unsigned long length,
> >> +			  u64 __user *data)
> >> +{
> >> +	struct iova_bitmap *dirty = &iter->dirty;
> >> +
> >> +	iter->data = data;
> >> +	iter->offset = 0;
> >> +	iter->count = iova_bitmap_iova_to_index(iter, length);  
> > 
> > If this works, it's because the DIV_ROUND_UP above accounted for what
> > should have been and index-to-count fixup here, ie. add one.
> >   
> As mentioned earlier, I'll change that to the suggestion above.
> 
> >> +	iter->iova = iova;
> >> +	iter->length = length;
> >> +	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
> >> +
> >> +	return !dirty->pages ? -ENOMEM : 0;
> >> +}
> >> +
> >> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
> >> +{
> >> +	struct iova_bitmap *dirty = &iter->dirty;
> >> +
> >> +	if (dirty->pages) {
> >> +		free_page((unsigned long)dirty->pages);
> >> +		dirty->pages = NULL;
> >> +	}
> >> +}
> >> +
> >> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter)
> >> +{
> >> +	return iter->offset >= iter->count;
> >> +}
> >> +
> >> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)
> >> +{
> >> +	unsigned long max_iova = iter->dirty.iova + iter->length;
> >> +	unsigned long left = iova_bitmap_iter_left(iter);
> >> +	unsigned long iova = iova_bitmap_iova(iter);
> >> +
> >> +	left = iova_bitmap_index_to_iova(iter, left);  
> > 
> > @left is first used for number of indexes and then for an iova range :(
> >   
> I was trying to avoid an extra variable and an extra long line.
> 
> >> +	if (iova + left > max_iova)
> >> +		left -= ((iova + left) - max_iova);
> >> +
> >> +	return left;
> >> +}  
> > 
> > IIUC, this is returning the iova free space in the bitmap, not the
> > length of the bitmap??
> >   
> This is essentially representing your bitmap working set IOW the length
> of the *pinned* bitmap. Not the size of the whole bitmap.
> 
> >> +
> >> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
> >> +{
> >> +	unsigned long skip = iter->offset;
> >> +
> >> +	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
> >> +}  
> > 
> > It would help if this were defined above it's usage above.
> >   
> I'll move it.
> 
> >> +
> >> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter)
> >> +{
> >> +	unsigned long length = iova_bitmap_length(iter);
> >> +
> >> +	iter->offset += iova_bitmap_iova_to_index(iter, length);  
> > 
> > Again, fudging an index count based on bogus index value.
> >   
> As mentioned earlier, I'll change that iova_bitmap_iova_to_index()
> to return an index instead of nr of elements.
> 
> >> +}
> >> +
> >> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
> >> +{
> >> +	struct iova_bitmap *dirty = &iter->dirty;
> >> +
> >> +	if (dirty->npages)
> >> +		unpin_user_pages(dirty->pages, dirty->npages);  
> > 
> > dirty->npages = 0;?
> >   
> Sadly no, because after iova_bitmap_iter_put() we will call
> iova_bitmap_iter_advance() to go to the next chunk of the bitmap
> (i.e. the next 64G of IOVA, or IOW the next 2M of bitmap memory).
> 
> I could remove explicit calls to iova_bitmap_iter_{get,put}()
> while making them internal only and merge it in iova_bitmap_iter_advance()
> and iova_bimap_iter_init. This should a bit simpler for API user
> and I would be able to clear npages here. Let me see how this looks.
> 
> >> +}
> >> +
> >> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
> >> +{
> >> +	struct iova_bitmap *dirty = &iter->dirty;
> >> +	unsigned long npages;
> >> +	u64 __user *addr;
> >> +	long ret;
> >> +
> >> +	npages = DIV_ROUND_UP((iter->count - iter->offset) *
> >> +			      sizeof(*iter->data), PAGE_SIZE);
> >> +	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
> >> +	addr = iter->data + iter->offset;
> >> +	ret = pin_user_pages_fast((unsigned long)addr, npages,
> >> +				  FOLL_WRITE, dirty->pages);
> >> +	if (ret <= 0)
> >> +		return ret;
> >> +
> >> +	dirty->npages = (unsigned long)ret;
> >> +	dirty->iova = iova_bitmap_iova(iter);
> >> +	dirty->start_offset = offset_in_page(addr);
> >> +	return 0;
> >> +}
> >> +
> >> +void iova_bitmap_init(struct iova_bitmap *bitmap,
> >> +		      unsigned long base, unsigned long pgshift)
> >> +{
> >> +	memset(bitmap, 0, sizeof(*bitmap));
> >> +	bitmap->iova = base;
> >> +	bitmap->pgshift = pgshift;
> >> +}
> >> +
> >> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
> >> +			     unsigned long iova,
> >> +			     unsigned long length)
> >> +{
> >> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
> >> +
> >> +	nbits = max(1UL, length >> dirty->pgshift);
> >> +	offset = (iova - dirty->iova) >> dirty->pgshift;
> >> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
> >> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
> >> +	start_offset = dirty->start_offset;
> >> +
> >> +	while (nbits > 0) {
> >> +		kaddr = kmap_local_page(dirty->pages[idx]) + start_offset;
> >> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
> >> +		bitmap_set(kaddr, offset, size);
> >> +		kunmap_local(kaddr - start_offset);
> >> +		start_offset = offset = 0;
> >> +		nbits -= size;
> >> +		idx++;
> >> +	}
> >> +
> >> +	return nbits;
> >> +}
> >> +EXPORT_SYMBOL_GPL(iova_bitmap_set);
> >> +
> >> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
> >> new file mode 100644
> >> index 000000000000..c474c351634a
> >> --- /dev/null
> >> +++ b/include/linux/iova_bitmap.h
> >> @@ -0,0 +1,46 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +/*
> >> + * Copyright (c) 2022, Oracle and/or its affiliates.
> >> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> >> + */
> >> +
> >> +#ifndef _IOVA_BITMAP_H_
> >> +#define _IOVA_BITMAP_H_
> >> +
> >> +#include <linux/highmem.h>
> >> +#include <linux/mm.h>
> >> +#include <linux/uio.h>
> >> +
> >> +struct iova_bitmap {
> >> +	unsigned long iova;
> >> +	unsigned long pgshift;
> >> +	unsigned long start_offset;
> >> +	unsigned long npages;
> >> +	struct page **pages;
> >> +};
> >> +
> >> +struct iova_bitmap_iter {
> >> +	struct iova_bitmap dirty;
> >> +	u64 __user *data;
> >> +	size_t offset;
> >> +	size_t count;
> >> +	unsigned long iova;
> >> +	unsigned long length;
> >> +};
> >> +
> >> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
> >> +			  unsigned long length, u64 __user *data);
> >> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
> >> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
> >> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
> >> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
> >> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
> >> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter);
> >> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
> >> +void iova_bitmap_init(struct iova_bitmap *bitmap,
> >> +		      unsigned long base, unsigned long pgshift);
> >> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
> >> +			     unsigned long iova,
> >> +			     unsigned long length);
> >> +
> >> +#endif  
> > 
> > No relevant comments, no theory of operation.  I found this really
> > difficult to review and the page handling is still not clear to me.
> > I'm not willing to take on maintainership of this code under
> > drivers/vfio/ as is.   
> 
> Sorry for the lack of comments/docs and lack of clearity in some of the
> functions. I'll document all functions/fields and add a comment bloc at
> the top explaining the theory on how it should be used/works, alongside
> the improvements you suggested above.
> 
> Meanwhile what is less clear for you on the page handling? We are essentially
> calculating the number of pages based of @offset and @count and then preping
> the iova_bitmap (@dirty) with the base IOVA and page offset. iova_bitmap_set()
> then computes where is the should start setting bits, and then it kmap() each page
> and sets the said bits. So far I am not caching kmap() kaddr,
> so the majority of iova_bitmap_set() complexity comes from iterating over number
> of bits to kmap and accounting to the offset that user bitmap address had.

It could have saved a lot of struggling through this code if it were
presented as a windowing scheme to iterate over a user bitmap.

As I understand it more though, does the API really fit the expected use
cases?  As presented here and used in the following patch, we map every
section of the user bitmap, present that section to the device driver
and ask them to mark dirty bits and atomically clear their internal
tracker for that sub-range.  This seems really inefficient.

Are we just counting on the fact that each 2MB window of dirty bitmap
is 64GB of guest RAM (assuming 4KB pages) and there's likely something
dirty there?

It seems like a more efficient API might be for us to call the device
driver with an iterator object, which the device driver uses to call
back into this bitmap helper to set specific iova+length ranges as
dirty.  The iterator could still cache the kmap'd page (or pages) to
optimize localized dirties, but we don't necessarily need to kmap and
present every section of the bitmap to the driver.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support
  2022-07-20 16:47       ` Alex Williamson
@ 2022-07-20 17:27         ` Jason Gunthorpe
  2022-07-20 18:16         ` Joao Martins
  1 sibling, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-20 17:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Joao Martins, Yishai Hadas, saeedm, kvm, netdev, kuba,
	kevin.tian, leonro, maorg, cohuck

On Wed, Jul 20, 2022 at 10:47:25AM -0600, Alex Williamson wrote:

> As I understand it more though, does the API really fit the expected use
> cases?  As presented here and used in the following patch, we map every
> section of the user bitmap, present that section to the device driver
> and ask them to mark dirty bits and atomically clear their internal
> tracker for that sub-range.  This seems really inefficient.

I think until someone sits down and benchmarks it, it will be hard to
really tell what is the rigtht trade offs are.

pin_user_pages_fast() is fairly slow, so calling it once per 4k of
user VA is definately worse than trying to call it once for 2M of user
VA.

On the other hand very very big guests are possibly likely to have
64GB regions where there are no dirties.

But, sweeping the 64GB in the first place is possibly going to be
slow, so saving a little bit of pin_user_pages time may not matter
much.

On the other hand, cases like vIOMMU will have huge swaths of IOVA
where there just nothing mapped so perhaps sweeping for the system
IOMMU will be fast and pin_user_pages overhead will be troublesome.

Still, another view point is that returning a bitmap at all is really,
ineffecient if we expect high sparsity and we should return dirty pfns
and a simple put_user may be sufficient. It may make sense to have a
2nd API that works like this, userspace could call it during stop_copy
on the assumption of high sparsity.

We just don't have enough ecosystem going right now to sit down and do
all this benchmarking works, so I was happy with the simplistic
implementation here, it is only 160 lines, if we toss it later based
on benchmarks no biggie. The important thing was that that this
abstraction exist at all and that drivers don't do their own thing.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support
  2022-07-20 16:47       ` Alex Williamson
  2022-07-20 17:27         ` Jason Gunthorpe
@ 2022-07-20 18:16         ` Joao Martins
  1 sibling, 0 replies; 52+ messages in thread
From: Joao Martins @ 2022-07-20 18:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro,
	maorg, cohuck

On 7/20/22 17:47, Alex Williamson wrote:
> On Wed, 20 Jul 2022 02:57:24 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 7/19/22 20:01, Alex Williamson wrote:
>>> On Thu, 14 Jul 2022 11:12:45 +0300
>>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>

[snip]

>>>> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
>>>> new file mode 100644
>>>> index 000000000000..c474c351634a
>>>> --- /dev/null
>>>> +++ b/include/linux/iova_bitmap.h
>>>> @@ -0,0 +1,46 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/*
>>>> + * Copyright (c) 2022, Oracle and/or its affiliates.
>>>> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>>>> + */
>>>> +
>>>> +#ifndef _IOVA_BITMAP_H_
>>>> +#define _IOVA_BITMAP_H_
>>>> +
>>>> +#include <linux/highmem.h>
>>>> +#include <linux/mm.h>
>>>> +#include <linux/uio.h>
>>>> +
>>>> +struct iova_bitmap {
>>>> +	unsigned long iova;
>>>> +	unsigned long pgshift;
>>>> +	unsigned long start_offset;
>>>> +	unsigned long npages;
>>>> +	struct page **pages;
>>>> +};
>>>> +
>>>> +struct iova_bitmap_iter {
>>>> +	struct iova_bitmap dirty;
>>>> +	u64 __user *data;
>>>> +	size_t offset;
>>>> +	size_t count;
>>>> +	unsigned long iova;
>>>> +	unsigned long length;
>>>> +};
>>>> +
>>>> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
>>>> +			  unsigned long length, u64 __user *data);
>>>> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
>>>> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
>>>> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
>>>> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
>>>> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
>>>> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter);
>>>> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
>>>> +void iova_bitmap_init(struct iova_bitmap *bitmap,
>>>> +		      unsigned long base, unsigned long pgshift);
>>>> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
>>>> +			     unsigned long iova,
>>>> +			     unsigned long length);
>>>> +
>>>> +#endif  
>>>
>>> No relevant comments, no theory of operation.  I found this really
>>> difficult to review and the page handling is still not clear to me.
>>> I'm not willing to take on maintainership of this code under
>>> drivers/vfio/ as is.   
>>
>> Sorry for the lack of comments/docs and lack of clearity in some of the
>> functions. I'll document all functions/fields and add a comment bloc at
>> the top explaining the theory on how it should be used/works, alongside
>> the improvements you suggested above.
>>
>> Meanwhile what is less clear for you on the page handling? We are essentially
>> calculating the number of pages based of @offset and @count and then preping
>> the iova_bitmap (@dirty) with the base IOVA and page offset. iova_bitmap_set()
>> then computes where is the should start setting bits, and then it kmap() each page
>> and sets the said bits. So far I am not caching kmap() kaddr,
>> so the majority of iova_bitmap_set() complexity comes from iterating over number
>> of bits to kmap and accounting to the offset that user bitmap address had.
> 
> It could have saved a lot of struggling through this code if it were
> presented as a windowing scheme to iterate over a user bitmap.
> 
> As I understand it more though, does the API really fit the expected use
> cases?  As presented here and used in the following patch, we map every
> section of the user bitmap, present that section to the device driver
> and ask them to mark dirty bits and atomically clear their internal
> tracker for that sub-range.  This seems really inefficient.
> 
So with either IOMMU and VFIO vendor driver the hardware may marshal their dirty
bits in entirely separate manners. On IOMMUs it is unbounded and PTEs format
vary, so there's no way but to walk all domain pagetables since the beginning of the
(mapped) IOVA range and check that every PTE is dirty or not and this is going to be
rather expensive, the next cost would be between 1) to copy bitmaps back and forth or
2) pin . 2)  it's cheaper if it is over 2M chunks (i.e. fewer atomics there) unless
we take the slow-path. On VFIO there's no intermediate storage for the driver,
and even we were going to preregister anything vendor we would have to copy MBs
of bitmaps to user memory (worst case e.g 32MiB per Tb). Although there's some
unefficiency on unnecessary pinning of potential non-dirty IOVA ranges if the user
doesn't mark anything dirty.

Trying to avoid copies as iova_bitmap, the main cost is in the pinning and
making it dependent on dirties (rather than windowing) means we could pin
individual pages of the bitmap more often (with efficiency being a bit more
tied to the VF workload or vIOMMU).

> Are we just counting on the fact that each 2MB window of dirty bitmap
> is 64GB of guest RAM (assuming 4KB pages) and there's likely something
> dirty there?
> 
Yes, and likely there's enough except when we get reports for very big
IOVA ranges when usually there's more than one iteration. 4K of user
memory would represent a section (128M).

> It seems like a more efficient API might be for us to call the device
> driver with an iterator object, which the device driver uses to call
> back into this bitmap helper to set specific iova+length ranges as
> dirty.

I can explore another variant. With some control over how it advances
the bitmap the driver could easily adjust the iova_bitmap as it see fit
without necessarily having to walk the whole bitmap memory while retaining
the same iova_bitmap general facility. The downside(?) would be that end
drivers (iommu driver, or vfio vendor driver) need to work (pin) with user
buffers rather than kernel managed pages.

> The iterator could still cache the kmap'd page (or pages) to
> optimize localized dirties, but we don't necessarily need to kmap and
> present every section of the bitmap to the driver. 
kmap_local_page() is cheap (IOW page_address(page)), unless its highmem (AIUI).

The expensive part for zerocopy approach is having to pin pages prior to have
iova_bitmap_set(). If the device is doing IOs scattered across a relatively
ram sections I am not sure how the caching will be efficient.

> Thanks,
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver
  2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (10 preceding siblings ...)
  2022-07-14  8:12 ` [PATCH V2 vfio 11/11] vfio/mlx5: Set the driver DMA logging callbacks Yishai Hadas
@ 2022-07-21  8:26 ` Tian, Kevin
  2022-07-21  8:55   ` Yishai Hadas
  11 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-21  8:26 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, Martins, Joao, leonro, maorg, cohuck

> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Thursday, July 14, 2022 4:13 PM
> 
> It abstracts how an IOVA range is represented in a bitmap that is
> granulated by a given page_size. So it translates all the lifting of
> dealing with user pointers into its corresponding kernel addresses
> backing said user memory into doing finally the bitmap ops to change
> various bits.
> 

Don't quite understand the last sentence...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 01/11] net/mlx5: Introduce ifc bits for page tracker
  2022-07-14  8:12 ` [PATCH V2 vfio 01/11] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
@ 2022-07-21  8:28   ` Tian, Kevin
  2022-07-21  8:43     ` Yishai Hadas
  0 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-21  8:28 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, Martins, Joao, leonro, maorg, cohuck

> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Thursday, July 14, 2022 4:13 PM
> @@ -1711,7 +1712,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
>  	u8         max_geneve_tlv_options[0x8];
>  	u8         reserved_at_568[0x3];
>  	u8         max_geneve_tlv_option_data_len[0x5];
> -	u8         reserved_at_570[0x10];
> +	u8         reserved_at_570[0x9];
> +	u8         adv_virtualization[0x1];
> +	u8         reserved_at_57a[0x6];
> 
>  	u8	   reserved_at_580[0xb];

any reason why the two consecutive reserved fields cannot
be merged into one?

	u8	resered_at_57a[0x11];

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 01/11] net/mlx5: Introduce ifc bits for page tracker
  2022-07-21  8:28   ` Tian, Kevin
@ 2022-07-21  8:43     ` Yishai Hadas
  0 siblings, 0 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-21  8:43 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, Martins, Joao, leonro, maorg, cohuck

On 21/07/2022 11:28, Tian, Kevin wrote:
>> From: Yishai Hadas <yishaih@nvidia.com>
>> Sent: Thursday, July 14, 2022 4:13 PM
>> @@ -1711,7 +1712,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
>>   	u8         max_geneve_tlv_options[0x8];
>>   	u8         reserved_at_568[0x3];
>>   	u8         max_geneve_tlv_option_data_len[0x5];
>> -	u8         reserved_at_570[0x10];
>> +	u8         reserved_at_570[0x9];
>> +	u8         adv_virtualization[0x1];
>> +	u8         reserved_at_57a[0x6];
>>
>>   	u8	   reserved_at_580[0xb];
> any reason why the two consecutive reserved fields cannot
> be merged into one?

This follows our convention in this file that each 32 bits are separated 
into a block to ease tracking.

Yishai

>
> 	u8	resered_at_57a[0x11];



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-14  8:12 ` [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs Yishai Hadas
  2022-07-18 22:29   ` Alex Williamson
@ 2022-07-21  8:45   ` Tian, Kevin
  2022-07-21 12:05     ` Jason Gunthorpe
       [not found]     ` <56bd06d3-944c-18da-86ed-ae14ce5940b7@nvidia.com>
  1 sibling, 2 replies; 52+ messages in thread
From: Tian, Kevin @ 2022-07-21  8:45 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, Martins, Joao, leonro, maorg, cohuck

> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Thursday, July 14, 2022 4:13 PM
> 
> DMA logging allows a device to internally record what DMAs the device is
> initiating and report them back to userspace. It is part of the VFIO
> migration infrastructure that allows implementing dirty page tracking
> during the pre copy phase of live migration. Only DMA WRITEs are logged,
> and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> 
> This patch introduces the DMA logging involved uAPIs.
> 
> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
> 
> It exposes a PROBE option to detect if the device supports DMA logging.
> It exposes a SET option to start device DMA logging in given IOVAs
> ranges.
> It exposes a SET option to stop device DMA logging that was previously
> started.
> It exposes a GET option to read back and clear the device DMA log.
> 
> Extra details exist as part of vfio.h per a specific option.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 79
> +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 79 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 733a1cddde30..81475c3e7c92 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -986,6 +986,85 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>  };
> 
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.

both 'start'/'stop' are via VFIO_DEVICE_FEATURE_SET

> + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device
> supports
> + * DMA logging.
> + *
> + * DMA logging allows a device to internally record what DMAs the device is
> + * initiating and report them back to userspace. It is part of the VFIO
> + * migration infrastructure that allows implementing dirty page tracking
> + * during the pre copy phase of live migration. Only DMA WRITEs are logged,

Then 'DMA dirty logging' might be a more accurate name throughput this
series.

> + * and this API is not connected to
> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.

didn't get the point of this explanation.

> + *
> + * When DMA logging is started a range of IOVAs to monitor is provided and
> the
> + * device can optimize its logging to cover only the IOVA range given. Each
> + * DMA that the device initiates inside the range will be logged by the device
> + * for later retrieval.
> + *
> + * page_size is an input that hints what tracking granularity the device
> + * should try to achieve. If the device cannot do the hinted page size then it
> + * should pick the next closest page size it supports. On output the device

next closest 'smaller' page size?

> + * will return the page size it selected.
> + *
> + * ranges is a pointer to an array of
> + * struct vfio_device_feature_dma_logging_range.
> + */
> +struct vfio_device_feature_dma_logging_control {
> +	__aligned_u64 page_size;
> +	__u32 num_ranges;
> +	__u32 __reserved;
> +	__aligned_u64 ranges;
> +};

should we move the definition of LOG_MAX_RANGES to be here
so the user can know the max limits of tracked ranges?

> +
> +struct vfio_device_feature_dma_logging_range {
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3

Can the user update the range list by doing another START?

> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was
> started
> + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4

Is there value of allowing the user to stop tracking of a specific range?

> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA
> log
> + *
> + * Query the device's DMA log for written pages within the given IOVA range.
> + * During querying the log is cleared for the IOVA range.
> + *
> + * bitmap is a pointer to an array of u64s that will hold the output bitmap
> + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
> + * is given by:
> + *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
> + *
> + * The input page_size can be any power of two value and does not have to
> + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START.
> The driver
> + * will format its internal logging to match the reporting page size, possibly
> + * by replicating bits if the internal page size is lower than requested.

what's the purpose of this? I didn't quite get why an user would want to
start tracking in one page size and then read back the dirty bitmap in
another page size...

> + *
> + * Bits will be updated in bitmap using atomic or to allow userspace to
> + * combine bitmaps from multiple trackers together. Therefore userspace
> must
> + * zero the bitmap before doing any reports.

I'm a bit lost here. Since we allow userspace to combine bitmaps from 
multiple trackers then it's perfectly sane for userspace to leave bitmap 
with some 1's from one tracker when doing a report from another tracker.

> + *
> + * If any error is returned userspace should assume that the dirty log is
> + * corrupted and restart.
> + *
> + * If DMA logging is not enabled, an error will be returned.
> + *
> + */
> +struct vfio_device_feature_dma_logging_report {
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +	__aligned_u64 page_size;
> +	__aligned_u64 bitmap;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
> 
>  /**
> --
> 2.18.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-19 20:08         ` Jason Gunthorpe
@ 2022-07-21  8:54           ` Tian, Kevin
  2022-07-21 11:50             ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-21  8:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Yishai Hadas, saeedm, kvm, netdev, kuba, Martins, Joao, leonro,
	maorg, cohuck

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, July 20, 2022 4:08 AM
> 
> On Tue, Jul 19, 2022 at 01:25:14PM -0600, Alex Williamson wrote:
> 
> > > We don't really expect user space to hit this limit, the RAM in QEMU is
> > > divided today to around ~12 ranges as we saw so far in our evaluation.
> >
> > There can be far more for vIOMMU use cases or non-QEMU drivers.
> 
> Not really, it isn't dynamic so vIOMMU has to decide what it wants to
> track up front. It would never make sense to track based on what is
> currently mapped. So it will be some small list, probably a big linear
> chunk of the IOVA space.

How would vIOMMU make such decision when the address space 
is managed by the guest? it is dynamic and could be sparse. I'm
curious about any example a vIOMMU can use to generate such small
list. Would it be a single range based on aperture reported from the
kernel?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver
  2022-07-21  8:26 ` [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Tian, Kevin
@ 2022-07-21  8:55   ` Yishai Hadas
  0 siblings, 0 replies; 52+ messages in thread
From: Yishai Hadas @ 2022-07-21  8:55 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, Martins, Joao, leonro, maorg, cohuck

On 21/07/2022 11:26, Tian, Kevin wrote:
>> From: Yishai Hadas <yishaih@nvidia.com>
>> Sent: Thursday, July 14, 2022 4:13 PM
>>
>> It abstracts how an IOVA range is represented in a bitmap that is
>> granulated by a given page_size. So it translates all the lifting of
>> dealing with user pointers into its corresponding kernel addresses
>> backing said user memory into doing finally the bitmap ops to change
>> various bits.
>>
> Don't quite understand the last sentence...

It would come to say that this new functionality abstracts the 
complexity of user/kernel bitmap pointer usage and finally enables an 
easy way to set some bits.

I may rephrase as part of V3 to let it be clearer.

Yishai


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-21  8:54           ` Tian, Kevin
@ 2022-07-21 11:50             ` Jason Gunthorpe
  2022-07-25  7:38               ` Tian, Kevin
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-21 11:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yishai Hadas, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

On Thu, Jul 21, 2022 at 08:54:39AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, July 20, 2022 4:08 AM
> > 
> > On Tue, Jul 19, 2022 at 01:25:14PM -0600, Alex Williamson wrote:
> > 
> > > > We don't really expect user space to hit this limit, the RAM in QEMU is
> > > > divided today to around ~12 ranges as we saw so far in our evaluation.
> > >
> > > There can be far more for vIOMMU use cases or non-QEMU drivers.
> > 
> > Not really, it isn't dynamic so vIOMMU has to decide what it wants to
> > track up front. It would never make sense to track based on what is
> > currently mapped. So it will be some small list, probably a big linear
> > chunk of the IOVA space.
> 
> How would vIOMMU make such decision when the address space 
> is managed by the guest? it is dynamic and could be sparse. I'm
> curious about any example a vIOMMU can use to generate such small
> list. Would it be a single range based on aperture reported from the
> kernel?

Yes. qemu has to select a static aperture at start.

 The entire aperture is best, if that fails

 A smaller aperture and hope the guest doesn't use the whole space, if
 that fails,

 The entire guest physical map and hope the guest is in PT mode

All of these options are small lists.

Any vIOMMU maps that are created outside of what was asked to be
tracked have to be made permanently dirty by qemu.

Jason 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-21  8:45   ` Tian, Kevin
@ 2022-07-21 12:05     ` Jason Gunthorpe
  2022-07-25  7:20       ` Tian, Kevin
       [not found]     ` <56bd06d3-944c-18da-86ed-ae14ce5940b7@nvidia.com>
  1 sibling, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-21 12:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yishai Hadas, alex.williamson, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

On Thu, Jul 21, 2022 at 08:45:10AM +0000, Tian, Kevin wrote:

> > + * will format its internal logging to match the reporting page size, possibly
> > + * by replicating bits if the internal page size is lower than requested.
> 
> what's the purpose of this? I didn't quite get why an user would want to
> start tracking in one page size and then read back the dirty bitmap in
> another page size...

There may be multiple kernel trackers that are working with different
underlying block sizes, so the concept is userspace decides what block
size it wants to work in and the kernel side transparently adapts. The
math is simple so putting it in the kernel is convenient.

Effectively the general vision is that qemu would allocate one
reporting buffer and then invoke these IOCTLs in parallel on all the
trackers then process the single bitmap. Forcing qemu to allocate
bitmaps per tracker page size is just inefficient.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-21 12:05     ` Jason Gunthorpe
@ 2022-07-25  7:20       ` Tian, Kevin
  2022-07-25 14:33         ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-25  7:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, July 21, 2022 8:05 PM
> 
> On Thu, Jul 21, 2022 at 08:45:10AM +0000, Tian, Kevin wrote:
> 
> > > + * will format its internal logging to match the reporting page size,
> possibly
> > > + * by replicating bits if the internal page size is lower than requested.
> >
> > what's the purpose of this? I didn't quite get why an user would want to
> > start tracking in one page size and then read back the dirty bitmap in
> > another page size...
> 
> There may be multiple kernel trackers that are working with different
> underlying block sizes, so the concept is userspace decides what block
> size it wants to work in and the kernel side transparently adapts. The
> math is simple so putting it in the kernel is convenient.
> 
> Effectively the general vision is that qemu would allocate one
> reporting buffer and then invoke these IOCTLs in parallel on all the
> trackers then process the single bitmap. Forcing qemu to allocate
> bitmaps per tracker page size is just inefficient.
> 

I got that point. But my question is slightly different.

A practical flow would like below:

1) Qemu first requests to start dirty tracking in 4KB page size.
   Underlying trackers may start tracking in 4KB, 256KB, 2MB,
   etc. based on their own constraints.

2) Qemu then reads back dirty reports in a shared bitmap in
   4KB page size. All trackers must update dirty bitmap in 4KB
   granular regardless of the actual size each tracker selects.

Is there a real usage where Qemu would want to attempt
different page sizes between above two steps?

If not then I wonder whether a simpler design is to just have 
page size specified in the first step and then inherited by the 
2nd step...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
       [not found]     ` <56bd06d3-944c-18da-86ed-ae14ce5940b7@nvidia.com>
@ 2022-07-25  7:30       ` Tian, Kevin
  2022-07-26  8:37         ` Yishai Hadas
  0 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-25  7:30 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, Martins, Joao, leonro, maorg, cohuck

<please use plain-text next time>

> From: Yishai Hadas <yishaih@nvidia.com> 
> Sent: Thursday, July 21, 2022 7:06 PM
> > > +/*
> > > + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
> >
> > both 'start'/'stop' are via VFIO_DEVICE_FEATURE_SET
>
> Right, we have a note for that near VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP.
> Here it refers to the start option.

let's make it accurate here.

> > > + * page_size is an input that hints what tracking granularity the device
> > > + * should try to achieve. If the device cannot do the hinted page size then it
> > > + * should pick the next closest page size it supports. On output the device
> >
> > next closest 'smaller' page size?
>
> Not only, it depends on the device capabilities/support and should be a driver choice.

'should pick next closest" is a guideline to the driver. If user requests 
8KB while the device supports 4KB and 16KB, which one is closest?

It's probably safer to just say that it's a driver choice when the hinted page
size cannot be set?

> > > +struct vfio_device_feature_dma_logging_control {
> > > +	__aligned_u64 page_size;
> > > +	__u32 num_ranges;
> > > +	__u32 __reserved;
> > > +	__aligned_u64 ranges;
> > > +};
> >
> > should we move the definition of LOG_MAX_RANGES to be here
> > so the user can know the max limits of tracked ranges?
> This was raised as an option as part of this mail thread.
> However, for now it seems redundant as we may not expect user space to hit this limit and it mainly comes to protect kernel from memory exploding by a malicious user.

No matter how realistic an user might hit an limitation, it doesn't
sound good to not expose it if existing.

> > > +
> > > +struct vfio_device_feature_dma_logging_range {
> > > +	__aligned_u64 iova;
> > > +	__aligned_u64 length;
> > > +};
> > > +
> > > +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
> >
> > Can the user update the range list by doing another START?
>
> No, single start to ask the device what to track and a matching single stop should follow at the end.

let's document it then.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-21 11:50             ` Jason Gunthorpe
@ 2022-07-25  7:38               ` Tian, Kevin
  2022-07-25 14:37                 ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-25  7:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Yishai Hadas, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, July 21, 2022 7:51 PM
> 
> On Thu, Jul 21, 2022 at 08:54:39AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, July 20, 2022 4:08 AM
> > >
> > > On Tue, Jul 19, 2022 at 01:25:14PM -0600, Alex Williamson wrote:
> > >
> > > > > We don't really expect user space to hit this limit, the RAM in QEMU is
> > > > > divided today to around ~12 ranges as we saw so far in our evaluation.
> > > >
> > > > There can be far more for vIOMMU use cases or non-QEMU drivers.
> > >
> > > Not really, it isn't dynamic so vIOMMU has to decide what it wants to
> > > track up front. It would never make sense to track based on what is
> > > currently mapped. So it will be some small list, probably a big linear
> > > chunk of the IOVA space.
> >
> > How would vIOMMU make such decision when the address space
> > is managed by the guest? it is dynamic and could be sparse. I'm
> > curious about any example a vIOMMU can use to generate such small
> > list. Would it be a single range based on aperture reported from the
> > kernel?
> 
> Yes. qemu has to select a static aperture at start.
> 
>  The entire aperture is best, if that fails
> 
>  A smaller aperture and hope the guest doesn't use the whole space, if
>  that fails,
> 
>  The entire guest physical map and hope the guest is in PT mode

That sounds a bit hacky... does it instead suggest that an interface
for reporting the supported ranges on a tracker could be helpful once
trying the entire aperture fails?

> 
> All of these options are small lists.
> 
> Any vIOMMU maps that are created outside of what was asked to be
> tracked have to be made permanently dirty by qemu.
> 
> Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-25  7:20       ` Tian, Kevin
@ 2022-07-25 14:33         ` Jason Gunthorpe
  2022-07-26  7:07           ` Tian, Kevin
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-25 14:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yishai Hadas, alex.williamson, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

On Mon, Jul 25, 2022 at 07:20:16AM +0000, Tian, Kevin wrote:

> I got that point. But my question is slightly different.
> 
> A practical flow would like below:
> 
> 1) Qemu first requests to start dirty tracking in 4KB page size.
>    Underlying trackers may start tracking in 4KB, 256KB, 2MB,
>    etc. based on their own constraints.
> 
> 2) Qemu then reads back dirty reports in a shared bitmap in
>    4KB page size. All trackers must update dirty bitmap in 4KB
>    granular regardless of the actual size each tracker selects.
> 
> Is there a real usage where Qemu would want to attempt
> different page sizes between above two steps?

If you multi-thread the tracker reads it will be efficient to populate
a single bitmap and then copy that single bitmapt to the dirty
transfer. In this case you want the page size conversion.

If qemu is just going to read sequentially then perhaps it doesn't.

But forcing a fixed page size just denies userspace this choice, and
it doesn't make the kernel any simpler because the kernel always must
have this code to adapt different page sizes to support the real iommu
with huge pages/etc.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-25  7:38               ` Tian, Kevin
@ 2022-07-25 14:37                 ` Jason Gunthorpe
  2022-07-26  7:34                   ` Tian, Kevin
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-25 14:37 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yishai Hadas, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

On Mon, Jul 25, 2022 at 07:38:52AM +0000, Tian, Kevin wrote:

> > Yes. qemu has to select a static aperture at start.
> > 
> >  The entire aperture is best, if that fails
> > 
> >  A smaller aperture and hope the guest doesn't use the whole space, if
> >  that fails,
> > 
> >  The entire guest physical map and hope the guest is in PT mode
> 
> That sounds a bit hacky... does it instead suggest that an interface
> for reporting the supported ranges on a tracker could be helpful once
> trying the entire aperture fails?

It is the "try and fail" approach. It gives the driver the most
flexability in processing the ranges to try and make them work. If we
attempt to describe all the device constraints that might exist we
will be here forever.

Eg the driver might be able to do the entire aperture, but it has to
use 2M pages or something.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-25 14:33         ` Jason Gunthorpe
@ 2022-07-26  7:07           ` Tian, Kevin
  0 siblings, 0 replies; 52+ messages in thread
From: Tian, Kevin @ 2022-07-26  7:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, July 25, 2022 10:33 PM
> 
> On Mon, Jul 25, 2022 at 07:20:16AM +0000, Tian, Kevin wrote:
> 
> > I got that point. But my question is slightly different.
> >
> > A practical flow would like below:
> >
> > 1) Qemu first requests to start dirty tracking in 4KB page size.
> >    Underlying trackers may start tracking in 4KB, 256KB, 2MB,
> >    etc. based on their own constraints.
> >
> > 2) Qemu then reads back dirty reports in a shared bitmap in
> >    4KB page size. All trackers must update dirty bitmap in 4KB
> >    granular regardless of the actual size each tracker selects.
> >
> > Is there a real usage where Qemu would want to attempt
> > different page sizes between above two steps?
> 
> If you multi-thread the tracker reads it will be efficient to populate
> a single bitmap and then copy that single bitmapt to the dirty
> transfer. In this case you want the page size conversion.
> 
> If qemu is just going to read sequentially then perhaps it doesn't.
> 
> But forcing a fixed page size just denies userspace this choice, and
> it doesn't make the kernel any simpler because the kernel always must
> have this code to adapt different page sizes to support the real iommu
> with huge pages/etc.
> 

OK, make sense.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-25 14:37                 ` Jason Gunthorpe
@ 2022-07-26  7:34                   ` Tian, Kevin
  2022-07-26 15:12                     ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-26  7:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Yishai Hadas, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, July 25, 2022 10:37 PM
> 
> On Mon, Jul 25, 2022 at 07:38:52AM +0000, Tian, Kevin wrote:
> 
> > > Yes. qemu has to select a static aperture at start.
> > >
> > >  The entire aperture is best, if that fails
> > >
> > >  A smaller aperture and hope the guest doesn't use the whole space, if
> > >  that fails,
> > >
> > >  The entire guest physical map and hope the guest is in PT mode
> >
> > That sounds a bit hacky... does it instead suggest that an interface
> > for reporting the supported ranges on a tracker could be helpful once
> > trying the entire aperture fails?
> 
> It is the "try and fail" approach. It gives the driver the most
> flexability in processing the ranges to try and make them work. If we
> attempt to describe all the device constraints that might exist we
> will be here forever.

Usually the caller of a 'try and fail' interface knows exactly what to
be tried and then call the interface to see whether the callee can
meet its requirement.

Now above turns out to be a 'guess and fail' approach with which
the caller doesn't know exactly what should be tried. In this case
even if the attempt succeeds it's a question how helpful it is.

But I can see why a reporting mechanism doesn't fit well with
your example below. In the worst case probably the user has to
decide between using vIOMMU vs. vfio DMA logging if a simple
policy of using the entire aperture doesn't work...

> 
> Eg the driver might be able to do the entire aperture, but it has to
> use 2M pages or something.
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-25  7:30       ` Tian, Kevin
@ 2022-07-26  8:37         ` Yishai Hadas
  2022-07-26 14:03           ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Yishai Hadas @ 2022-07-26  8:37 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, Martins, Joao, leonro, maorg, cohuck,
	Yishai Hadas

On 25/07/2022 10:30, Tian, Kevin wrote:
> <please use plain-text next time>
>
>> From: Yishai Hadas <yishaih@nvidia.com>
>> Sent: Thursday, July 21, 2022 7:06 PM
>>>> +/*
>>>> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
>>> both 'start'/'stop' are via VFIO_DEVICE_FEATURE_SET
>> Right, we have a note for that near VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP.
>> Here it refers to the start option.
> let's make it accurate here.

OK

>
>>>> + * page_size is an input that hints what tracking granularity the device
>>>> + * should try to achieve. If the device cannot do the hinted page size then it
>>>> + * should pick the next closest page size it supports. On output the device
>>> next closest 'smaller' page size?
>> Not only, it depends on the device capabilities/support and should be a driver choice.
> 'should pick next closest" is a guideline to the driver. If user requests
> 8KB while the device supports 4KB and 16KB, which one is closest?
>
> It's probably safer to just say that it's a driver choice when the hinted page
> size cannot be set?

Yes, may rephrase in V3 accordingly.

>
>>>> +struct vfio_device_feature_dma_logging_control {
>>>> +	__aligned_u64 page_size;
>>>> +	__u32 num_ranges;
>>>> +	__u32 __reserved;
>>>> +	__aligned_u64 ranges;
>>>> +};
>>> should we move the definition of LOG_MAX_RANGES to be here
>>> so the user can know the max limits of tracked ranges?
>> This was raised as an option as part of this mail thread.
>> However, for now it seems redundant as we may not expect user space to hit this limit and it mainly comes to protect kernel from memory exploding by a malicious user.
> No matter how realistic an user might hit an limitation, it doesn't
> sound good to not expose it if existing.

As Jason replied at some point here, we need to see a clear use case for 
more than a few 10's of ranges before we complicate things.

For now we don't see one. If one does crop up someday it is easy to add 
a new query, or some other behavior.

Alex,

Can you please comment here so that we can converge and be ready for V3 ?

>>>> +
>>>> +struct vfio_device_feature_dma_logging_range {
>>>> +	__aligned_u64 iova;
>>>> +	__aligned_u64 length;
>>>> +};
>>>> +
>>>> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
>>> Can the user update the range list by doing another START?
>> No, single start to ask the device what to track and a matching single stop should follow at the end.
> let's document it then.

OK

>
> Thanks
> Kevin
>
Thanks,
Yishai


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-26  8:37         ` Yishai Hadas
@ 2022-07-26 14:03           ` Alex Williamson
  2022-07-26 15:04             ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2022-07-26 14:03 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Tian, Kevin, jgg, saeedm, kvm, netdev, kuba, Martins, Joao,
	leonro, maorg, cohuck

On Tue, 26 Jul 2022 11:37:50 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 25/07/2022 10:30, Tian, Kevin wrote:
> > <please use plain-text next time>
> >  
> >> From: Yishai Hadas <yishaih@nvidia.com>
> >> Sent: Thursday, July 21, 2022 7:06 PM  
> >>>> +/*
> >>>> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.  
> >>> both 'start'/'stop' are via VFIO_DEVICE_FEATURE_SET  
> >> Right, we have a note for that near VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP.
> >> Here it refers to the start option.  
> > let's make it accurate here.  
> 
> OK
> 
> >  
> >>>> + * page_size is an input that hints what tracking granularity the device
> >>>> + * should try to achieve. If the device cannot do the hinted page size then it
> >>>> + * should pick the next closest page size it supports. On output the device  
> >>> next closest 'smaller' page size?  
> >> Not only, it depends on the device capabilities/support and should be a driver choice.  
> > 'should pick next closest" is a guideline to the driver. If user requests
> > 8KB while the device supports 4KB and 16KB, which one is closest?
> >
> > It's probably safer to just say that it's a driver choice when the hinted page
> > size cannot be set?  
> 
> Yes, may rephrase in V3 accordingly.
> 
> >  
> >>>> +struct vfio_device_feature_dma_logging_control {
> >>>> +	__aligned_u64 page_size;
> >>>> +	__u32 num_ranges;
> >>>> +	__u32 __reserved;
> >>>> +	__aligned_u64 ranges;
> >>>> +};  
> >>> should we move the definition of LOG_MAX_RANGES to be here
> >>> so the user can know the max limits of tracked ranges?  
> >> This was raised as an option as part of this mail thread.
> >> However, for now it seems redundant as we may not expect user space to hit this limit and it mainly comes to protect kernel from memory exploding by a malicious user.  
> > No matter how realistic an user might hit an limitation, it doesn't
> > sound good to not expose it if existing.  
> 
> As Jason replied at some point here, we need to see a clear use case for 
> more than a few 10's of ranges before we complicate things.
> 
> For now we don't see one. If one does crop up someday it is easy to add 
> a new query, or some other behavior.
> 
> Alex,
> 
> Can you please comment here so that we can converge and be ready for V3 ?

I raised the same concern myself, the reason for having a limit is
clear, but focusing on a single use case and creating an arbitrary
"good enough" limit that isn't exposed to userspace makes this an
implementation detail that can subtly break userspace.  For instance,
what if userspace comes to expect the limit is 1000 and we decide to be
even more strict?  If only a few 10s of entries are used, why isn't 100
more than sufficient?  We change it, we break userspace.  OTOH, if we
simply make use of that reserved field to expose the limit, now we have
a contract with userspace and we can change our implementation because
that detail of the implementation is visible to userspace.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-26 14:03           ` Alex Williamson
@ 2022-07-26 15:04             ` Jason Gunthorpe
  2022-07-28  4:05               ` Tian, Kevin
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-26 15:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, Tian, Kevin, saeedm, kvm, netdev, kuba, Martins,
	Joao, leonro, maorg, cohuck

On Tue, Jul 26, 2022 at 08:03:20AM -0600, Alex Williamson wrote:

> I raised the same concern myself, the reason for having a limit is
> clear, but focusing on a single use case and creating an arbitrary
> "good enough" limit that isn't exposed to userspace makes this an
> implementation detail that can subtly break userspace.  For instance,
> what if userspace comes to expect the limit is 1000 and we decide to be
> even more strict?  If only a few 10s of entries are used, why isn't 100
> more than sufficient?  

So lets use the number of elements that will fit in PAGE_SIZE as the
guideline. It means the kernel can memdup the userspace array into a
single kernel page of memory to process it, which seems reasonably
future proof in that we won't need to make it lower. Thus we can
promise we won't make it smaller.

However, remember, this isn't even the real device limit - this is
just the limit that the core kernel code will accept to marshal the
data to pass internally the driver.

I fully expect that the driver will still refuse ranges in certain
configurations even if they can be marshaled.

This is primarily why I don't think it make sense to expose some
internal limit that is not even the real "will the call succeed"
parameters.

The API is specifically designed as 'try and fail' to allow the
drivers flexibility it how they map the requested ranges to their
internal operations.

> We change it, we break userspace.  OTOH, if we simply make use of
> that reserved field to expose the limit, now we have a contract with
> userspace and we can change our implementation because that detail
> of the implementation is visible to userspace.  Thanks,

I think this is not correct, just because we made it discoverable does
not absolve the kernel of compatibility. If we change the limit, eg to
1, and a real userspace stops working then we still broke userspace.

Complaining that userspace does not check the discoverable limit
doesn't help matters - I seem to remember Linus has written about this
in recent times even.

So, it is ultimately not different from 'try and fail', unless we
implement some algorithm in qemu - an algorithm that would duplicate
the one we already have in the kernel :\

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support
  2022-07-26  7:34                   ` Tian, Kevin
@ 2022-07-26 15:12                     ` Jason Gunthorpe
  0 siblings, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-26 15:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yishai Hadas, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

On Tue, Jul 26, 2022 at 07:34:55AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, July 25, 2022 10:37 PM
> > 
> > On Mon, Jul 25, 2022 at 07:38:52AM +0000, Tian, Kevin wrote:
> > 
> > > > Yes. qemu has to select a static aperture at start.
> > > >
> > > >  The entire aperture is best, if that fails
> > > >
> > > >  A smaller aperture and hope the guest doesn't use the whole space, if
> > > >  that fails,
> > > >
> > > >  The entire guest physical map and hope the guest is in PT mode
> > >
> > > That sounds a bit hacky... does it instead suggest that an interface
> > > for reporting the supported ranges on a tracker could be helpful once
> > > trying the entire aperture fails?
> > 
> > It is the "try and fail" approach. It gives the driver the most
> > flexability in processing the ranges to try and make them work. If we
> > attempt to describe all the device constraints that might exist we
> > will be here forever.
> 
> Usually the caller of a 'try and fail' interface knows exactly what to
> be tried and then call the interface to see whether the callee can
> meet its requirement.

Which is exactly this case.

qemu has one thing to try that meets its full requirement - the entire
vIOMMU aperture.

The other two are possible options based on assumptions of how the
guest VM is operating that might work - but this guessing is entirely
between qemu and the VM, not something the kernel can help with.

So, from the kernel perspective qemu will try three things in order of
preference and the first to work will be the right one. Making the
kernel API more complicated is not going to help qemu guess what the
guest is doing any better.

In any case this is vIOMMU mode so if the VM establishes mappings
outside the tracked IOVA then qemu is aware of it and qemu can
perma-dirty those pages as part of its migration logic. It is not
broken, it just might not meet the SLA.

> But I can see why a reporting mechanism doesn't fit well with
> your example below. In the worst case probably the user has to
> decide between using vIOMMU vs. vfio DMA logging if a simple
> policy of using the entire aperture doesn't work...

Well, yes, this is exactly the situation unfortunately. Without
special HW support vIOMMU is not going to work perfectly, but there
are reasonably use cases where vIOMMU is on but the guest is in PT
mode that could work, or where the IOVA aperture is limited, or
so on..

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-26 15:04             ` Jason Gunthorpe
@ 2022-07-28  4:05               ` Tian, Kevin
  2022-07-28 12:06                 ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-28  4:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Yishai Hadas, saeedm, kvm, netdev, kuba, Martins, Joao, leonro,
	maorg, cohuck

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, July 26, 2022 11:05 PM
> 
> On Tue, Jul 26, 2022 at 08:03:20AM -0600, Alex Williamson wrote:
> 
> > I raised the same concern myself, the reason for having a limit is
> > clear, but focusing on a single use case and creating an arbitrary
> > "good enough" limit that isn't exposed to userspace makes this an
> > implementation detail that can subtly break userspace.  For instance,
> > what if userspace comes to expect the limit is 1000 and we decide to be
> > even more strict?  If only a few 10s of entries are used, why isn't 100
> > more than sufficient?
> 
> So lets use the number of elements that will fit in PAGE_SIZE as the
> guideline. It means the kernel can memdup the userspace array into a
> single kernel page of memory to process it, which seems reasonably
> future proof in that we won't need to make it lower. Thus we can
> promise we won't make it smaller.
> 
> However, remember, this isn't even the real device limit - this is
> just the limit that the core kernel code will accept to marshal the
> data to pass internally the driver.
> 
> I fully expect that the driver will still refuse ranges in certain
> configurations even if they can be marshaled.
> 
> This is primarily why I don't think it make sense to expose some
> internal limit that is not even the real "will the call succeed"
> parameters.
> 
> The API is specifically designed as 'try and fail' to allow the
> drivers flexibility it how they map the requested ranges to their
> internal operations.
> 
> > We change it, we break userspace.  OTOH, if we simply make use of
> > that reserved field to expose the limit, now we have a contract with
> > userspace and we can change our implementation because that detail
> > of the implementation is visible to userspace.  Thanks,
> 
> I think this is not correct, just because we made it discoverable does
> not absolve the kernel of compatibility. If we change the limit, eg to
> 1, and a real userspace stops working then we still broke userspace.

iiuc Alex's suggestion doesn't conflict with the 'try and fail' model.
By using the reserved field of vfio_device_feature_dma_logging_control
to return the limit of the specified page_size from a given tracker, 
the user can quickly retry and adapt to that limit if workable.

Otherwise what would be an efficient policy for user to retry after
a failure? Say initially user requests 100 ranges with 4K page size
but the tracker can only support 10 ranges. w/o a hint returned
from the tracker then the user just blindly try 100, 90, 80, ... or 
using a bisect algorithm?

> 
> Complaining that userspace does not check the discoverable limit
> doesn't help matters - I seem to remember Linus has written about this
> in recent times even.
> 
> So, it is ultimately not different from 'try and fail', unless we
> implement some algorithm in qemu - an algorithm that would duplicate
> the one we already have in the kernel :\
> 
> Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-28  4:05               ` Tian, Kevin
@ 2022-07-28 12:06                 ` Jason Gunthorpe
  2022-07-29  3:01                   ` Tian, Kevin
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-28 12:06 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yishai Hadas, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

On Thu, Jul 28, 2022 at 04:05:04AM +0000, Tian, Kevin wrote:

> > I think this is not correct, just because we made it discoverable does
> > not absolve the kernel of compatibility. If we change the limit, eg to
> > 1, and a real userspace stops working then we still broke userspace.
> 
> iiuc Alex's suggestion doesn't conflict with the 'try and fail' model.
> By using the reserved field of vfio_device_feature_dma_logging_control
> to return the limit of the specified page_size from a given tracker, 
> the user can quickly retry and adapt to that limit if workable.

Again, no it can't. The marshalling limit is not the device limit and
it will still potentially fail. Userspace does not gain much
additional API certainty by knowing this internal limit.

> Otherwise what would be an efficient policy for user to retry after
> a failure? Say initially user requests 100 ranges with 4K page size
> but the tracker can only support 10 ranges. w/o a hint returned
> from the tracker then the user just blindly try 100, 90, 80, ... or 
> using a bisect algorithm?

With what I just said the minimum is PAGE_SIZE, so if some userspace
is using a huge range list it should try the huge list first (assuming
the kernel has been updated because someone justified a use case
here), then try to narrow to PAGE_SIZE, then give up.

The main point is there is nothing for current qemu to do - we do not
want to duplicate the kernel narrowing algorithm into qemu - the idea
is to define the interface in a way that accomodates what qemu needs.

The only issue is the bounding the memory allocation.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-28 12:06                 ` Jason Gunthorpe
@ 2022-07-29  3:01                   ` Tian, Kevin
  2022-07-29 14:11                     ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2022-07-29  3:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Yishai Hadas, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, July 28, 2022 8:06 PM
> 
> On Thu, Jul 28, 2022 at 04:05:04AM +0000, Tian, Kevin wrote:
> 
> > > I think this is not correct, just because we made it discoverable does
> > > not absolve the kernel of compatibility. If we change the limit, eg to
> > > 1, and a real userspace stops working then we still broke userspace.
> >
> > iiuc Alex's suggestion doesn't conflict with the 'try and fail' model.
> > By using the reserved field of vfio_device_feature_dma_logging_control
> > to return the limit of the specified page_size from a given tracker,
> > the user can quickly retry and adapt to that limit if workable.
> 
> Again, no it can't. The marshalling limit is not the device limit and
> it will still potentially fail. Userspace does not gain much
> additional API certainty by knowing this internal limit.

Why cannot the tracker return device limit here?

> 
> > Otherwise what would be an efficient policy for user to retry after
> > a failure? Say initially user requests 100 ranges with 4K page size
> > but the tracker can only support 10 ranges. w/o a hint returned
> > from the tracker then the user just blindly try 100, 90, 80, ... or
> > using a bisect algorithm?
> 
> With what I just said the minimum is PAGE_SIZE, so if some userspace
> is using a huge range list it should try the huge list first (assuming
> the kernel has been updated because someone justified a use case
> here), then try to narrow to PAGE_SIZE, then give up.

Then probably it's good to document above as a guidance to
userspace.

> 
> The main point is there is nothing for current qemu to do - we do not
> want to duplicate the kernel narrowing algorithm into qemu - the idea
> is to define the interface in a way that accomodates what qemu needs.
> 
> The only issue is the bounding the memory allocation.
> 
> Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs
  2022-07-29  3:01                   ` Tian, Kevin
@ 2022-07-29 14:11                     ` Jason Gunthorpe
  0 siblings, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2022-07-29 14:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yishai Hadas, saeedm, kvm, netdev, kuba,
	Martins, Joao, leonro, maorg, cohuck

On Fri, Jul 29, 2022 at 03:01:51AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, July 28, 2022 8:06 PM
> > 
> > On Thu, Jul 28, 2022 at 04:05:04AM +0000, Tian, Kevin wrote:
> > 
> > > > I think this is not correct, just because we made it discoverable does
> > > > not absolve the kernel of compatibility. If we change the limit, eg to
> > > > 1, and a real userspace stops working then we still broke userspace.
> > >
> > > iiuc Alex's suggestion doesn't conflict with the 'try and fail' model.
> > > By using the reserved field of vfio_device_feature_dma_logging_control
> > > to return the limit of the specified page_size from a given tracker,
> > > the user can quickly retry and adapt to that limit if workable.
> > 
> > Again, no it can't. The marshalling limit is not the device limit and
> > it will still potentially fail. Userspace does not gain much
> > additional API certainty by knowing this internal limit.
> 
> Why cannot the tracker return device limit here?

Because it is "complicated". mlx5 doesn't really have a limit in
simple terms of the number of ranges - it has other limits based on
total span and page size.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2022-07-29 14:11 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-14  8:12 [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Yishai Hadas
2022-07-14  8:12 ` [PATCH V2 vfio 01/11] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
2022-07-21  8:28   ` Tian, Kevin
2022-07-21  8:43     ` Yishai Hadas
2022-07-14  8:12 ` [PATCH V2 vfio 02/11] net/mlx5: Query ADV_VIRTUALIZATION capabilities Yishai Hadas
2022-07-14  8:12 ` [PATCH V2 vfio 03/11] vfio: Introduce DMA logging uAPIs Yishai Hadas
2022-07-18 22:29   ` Alex Williamson
2022-07-19  1:39     ` Tian, Kevin
2022-07-19  5:40       ` Kirti Wankhede
2022-07-19  7:49     ` Yishai Hadas
2022-07-19 19:57       ` Alex Williamson
2022-07-19 20:18         ` Jason Gunthorpe
2022-07-21  8:45   ` Tian, Kevin
2022-07-21 12:05     ` Jason Gunthorpe
2022-07-25  7:20       ` Tian, Kevin
2022-07-25 14:33         ` Jason Gunthorpe
2022-07-26  7:07           ` Tian, Kevin
     [not found]     ` <56bd06d3-944c-18da-86ed-ae14ce5940b7@nvidia.com>
2022-07-25  7:30       ` Tian, Kevin
2022-07-26  8:37         ` Yishai Hadas
2022-07-26 14:03           ` Alex Williamson
2022-07-26 15:04             ` Jason Gunthorpe
2022-07-28  4:05               ` Tian, Kevin
2022-07-28 12:06                 ` Jason Gunthorpe
2022-07-29  3:01                   ` Tian, Kevin
2022-07-29 14:11                     ` Jason Gunthorpe
2022-07-14  8:12 ` [PATCH V2 vfio 04/11] vfio: Move vfio.c to vfio_main.c Yishai Hadas
2022-07-14  8:12 ` [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support Yishai Hadas
2022-07-18 22:30   ` Alex Williamson
2022-07-18 22:46     ` Jason Gunthorpe
2022-07-19 19:01   ` Alex Williamson
2022-07-20  1:57     ` Joao Martins
2022-07-20 16:47       ` Alex Williamson
2022-07-20 17:27         ` Jason Gunthorpe
2022-07-20 18:16         ` Joao Martins
2022-07-14  8:12 ` [PATCH V2 vfio 06/11] vfio: Introduce the DMA logging feature support Yishai Hadas
2022-07-18 22:30   ` Alex Williamson
2022-07-19  9:19     ` Yishai Hadas
2022-07-19 19:25       ` Alex Williamson
2022-07-19 20:08         ` Jason Gunthorpe
2022-07-21  8:54           ` Tian, Kevin
2022-07-21 11:50             ` Jason Gunthorpe
2022-07-25  7:38               ` Tian, Kevin
2022-07-25 14:37                 ` Jason Gunthorpe
2022-07-26  7:34                   ` Tian, Kevin
2022-07-26 15:12                     ` Jason Gunthorpe
2022-07-14  8:12 ` [PATCH V2 vfio 07/11] vfio/mlx5: Init QP based resources for dirty tracking Yishai Hadas
2022-07-14  8:12 ` [PATCH V2 vfio 08/11] vfio/mlx5: Create and destroy page tracker object Yishai Hadas
2022-07-14  8:12 ` [PATCH V2 vfio 09/11] vfio/mlx5: Report dirty pages from tracker Yishai Hadas
2022-07-14  8:12 ` [PATCH V2 vfio 10/11] vfio/mlx5: Manage error scenarios on tracker Yishai Hadas
2022-07-14  8:12 ` [PATCH V2 vfio 11/11] vfio/mlx5: Set the driver DMA logging callbacks Yishai Hadas
2022-07-21  8:26 ` [PATCH V2 vfio 00/11] Add device DMA logging support for mlx5 driver Tian, Kevin
2022-07-21  8:55   ` Yishai Hadas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).