linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
@ 2021-04-09  3:44 Shenming Lu
  2021-04-09  3:44 ` [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework Shenming Lu
                   ` (9 more replies)
  0 siblings, 10 replies; 40+ messages in thread
From: Shenming Lu @ 2021-04-09  3:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui,
	lushenming

Hi,

Requesting for your comments and suggestions. :-)

The static pinning and mapping problem in VFIO and possible solutions
have been discussed a lot [1, 2]. One of the solutions is to add I/O
Page Fault support for VFIO devices. Different from those relatively
complicated software approaches such as presenting a vIOMMU that provides
the DMA buffer information (might include para-virtualized optimizations),
IOPF mainly depends on the hardware faulting capability, such as the PCIe
PRI extension or Arm SMMU stall model. What's more, the IOPF support in
the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
support for VFIO passthrough based on the IOPF part of SVA in this series.

We have measured its performance with UADK [4] (passthrough an accelerator
to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):

Run hisi_sec_test...
 - with varying sending times and message lengths
 - with/without IOPF enabled (speed slowdown)

when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
            slowdown (num of faults)
 times      VFIO IOPF      host SVA
 1          63.4% (518)    82.8% (512)
 100        22.9% (1058)   47.9% (1024)
 1000       2.6% (1071)    8.5% (1024)

when msg_len = 10MB (and PREMAP_LEN = 512):
            slowdown (num of faults)
 times      VFIO IOPF
 1          32.6% (13)
 100        3.5% (26)
 1000       1.6% (26)

History:

v2 -> v3
 - Nit fixes.
 - No reason to disable reporting the unrecoverable faults. (baolu)
 - Maintain a global IOPF enabled group list.
 - Split the pre-mapping optimization to be a separate patch.
 - Add selective faulting support (use vfio_pin_pages to indicate the
   non-faultable scope and add a new struct vfio_range to record it,
   untested). (Kevin)

v1 -> v2
 - Numerous improvements following the suggestions. Thanks a lot to all
   of you.

Note that PRI is not supported at the moment since there is no hardware.

Links:
[1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
    2016.
[2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
    for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
[3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
[4] https://github.com/Linaro/uadk

Thanks,
Shenming


Shenming Lu (8):
  iommu: Evolve the device fault reporting framework
  vfio/type1: Add a page fault handler
  vfio/type1: Add an MMU notifier to avoid pinning
  vfio/type1: Pre-map more pages than requested in the IOPF handling
  vfio/type1: VFIO_IOMMU_ENABLE_IOPF
  vfio/type1: No need to statically pin and map if IOPF enabled
  vfio/type1: Add selective DMA faulting support
  vfio: Add nested IOPF support

 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
 drivers/iommu/iommu.c                         |   56 +-
 drivers/vfio/vfio.c                           |   85 +-
 drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
 include/linux/iommu.h                         |   19 +-
 include/linux/vfio.h                          |   13 +
 include/uapi/linux/iommu.h                    |    4 +
 include/uapi/linux/vfio.h                     |    6 +
 9 files changed, 1181 insertions(+), 23 deletions(-)

-- 
2.19.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
@ 2021-04-09  3:44 ` Shenming Lu
  2021-05-18 18:58   ` Alex Williamson
  2021-04-09  3:44 ` [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler Shenming Lu
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-04-09  3:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui,
	lushenming

This patch follows the discussion here:

https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/

Besides SVA/vSVA, such as VFIO may also enable (2nd level) IOPF to remove
pinning restriction. In order to better support more scenarios of using
device faults, we extend iommu_register_fault_handler() with flags and
introduce FAULT_REPORT_ to describe the device fault reporting capability
under a specific configuration.

Note that we don't further distinguish recoverable and unrecoverable faults
by flags in the fault reporting cap, having PAGE_FAULT_REPORT_ +
UNRECOV_FAULT_REPORT_ seems not a clean way.

In addition, still take VFIO as an example, in nested mode, the 1st level
and 2nd level fault reporting may be configured separately and currently
each device can only register one iommu dev fault handler, so we add a
handler update interface for this.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  3 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 18 ++++--
 drivers/iommu/iommu.c                         | 56 ++++++++++++++++++-
 include/linux/iommu.h                         | 19 ++++++-
 include/uapi/linux/iommu.h                    |  4 ++
 5 files changed, 90 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index ee66d1f4cb81..e6d766fb8f1a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct arm_smmu_master *master)
 	if (ret)
 		return ret;
 
-	ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);
+	ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf,
+						  FAULT_REPORT_FLAT, dev);
 	if (ret) {
 		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
 		return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 53abad8fdd91..51843f54a87f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1448,10 +1448,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt)
 		return -EOPNOTSUPP;
 	}
 
-	/* Stage-2 is always pinned at the moment */
-	if (evt[1] & EVTQ_1_S2)
-		return -EFAULT;
-
 	if (evt[1] & EVTQ_1_RnW)
 		perm |= IOMMU_FAULT_PERM_READ;
 	else
@@ -1469,26 +1465,36 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt)
 			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
 			.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
 			.perm = perm,
-			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
 		};
 
 		if (ssid_valid) {
 			flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
 			flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]);
 		}
+
+		if (evt[1] & EVTQ_1_S2) {
+			flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2;
+			flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]);
+		} else
+			flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]);
 	} else {
 		flt->type = IOMMU_FAULT_DMA_UNRECOV;
 		flt->event = (struct iommu_fault_unrecoverable) {
 			.reason = reason,
 			.flags = IOMMU_FAULT_UNRECOV_ADDR_VALID,
 			.perm = perm,
-			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
 		};
 
 		if (ssid_valid) {
 			flt->event.flags |= IOMMU_FAULT_UNRECOV_PASID_VALID;
 			flt->event.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]);
 		}
+
+		if (evt[1] & EVTQ_1_S2) {
+			flt->event.flags |= IOMMU_FAULT_UNRECOV_L2;
+			flt->event.addr = FIELD_GET(EVTQ_3_IPA, evt[3]);
+		} else
+			flt->event.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]);
 	}
 
 	mutex_lock(&smmu->streams_mutex);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d0b0a15dba84..b50b526b45ac 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1056,6 +1056,40 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
 }
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
+/*
+ * iommu_update_device_fault_handler - Update the device fault handler via flags
+ * @dev: the device
+ * @mask: bits(not set) to clear
+ * @set: bits to set
+ *
+ * Update the device fault handler installed by
+ * iommu_register_device_fault_handler().
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_update_device_fault_handler(struct device *dev, u32 mask, u32 set)
+{
+	struct dev_iommu *param = dev->iommu;
+	int ret = 0;
+
+	if (!param)
+		return -EINVAL;
+
+	mutex_lock(&param->lock);
+
+	if (param->fault_param) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	param->fault_param->flags = (param->fault_param->flags & mask) | set;
+
+out_unlock:
+	mutex_unlock(&param->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_update_device_fault_handler);
+
 /**
  * iommu_register_device_fault_handler() - Register a device fault handler
  * @dev: the device
@@ -1076,11 +1110,16 @@ EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
  */
 int iommu_register_device_fault_handler(struct device *dev,
 					iommu_dev_fault_handler_t handler,
-					void *data)
+					u32 flags, void *data)
 {
 	struct dev_iommu *param = dev->iommu;
 	int ret = 0;
 
+	/* Only under one configuration. */
+	if (flags & FAULT_REPORT_FLAT &&
+	    flags & (FAULT_REPORT_NESTED_L1 | FAULT_REPORT_NESTED_L2))
+		return -EINVAL;
+
 	if (!param)
 		return -EINVAL;
 
@@ -1099,6 +1138,7 @@ int iommu_register_device_fault_handler(struct device *dev,
 		goto done_unlock;
 	}
 	param->fault_param->handler = handler;
+	param->fault_param->flags = flags;
 	param->fault_param->data = data;
 	mutex_init(&param->fault_param->lock);
 	INIT_LIST_HEAD(&param->fault_param->faults);
@@ -1177,6 +1217,20 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
 		goto done_unlock;
 	}
 
+	if (!(fparam->flags & FAULT_REPORT_FLAT)) {
+		bool l2;
+
+		if (evt->fault.type == IOMMU_FAULT_PAGE_REQ)
+			l2 = evt->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
+		if (evt->fault.type == IOMMU_FAULT_DMA_UNRECOV)
+			l2 = evt->fault.event.flags & IOMMU_FAULT_UNRECOV_L2;
+
+		if (l2 && !(fparam->flags & FAULT_REPORT_NESTED_L2))
+			return -EOPNOTSUPP;
+		if (!l2 && !(fparam->flags & FAULT_REPORT_NESTED_L1))
+			return -EOPNOTSUPP;
+	}
+
 	if (evt->fault.type == IOMMU_FAULT_PAGE_REQ &&
 	    (evt->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) {
 		evt_pending = kmemdup(evt, sizeof(struct iommu_fault_event),
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 86d688c4418f..28dbca3c6d60 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -352,12 +352,19 @@ struct iommu_fault_event {
 /**
  * struct iommu_fault_param - per-device IOMMU fault data
  * @handler: Callback function to handle IOMMU faults at device level
+ * @flags: FAULT_REPORT_ indicates the fault reporting capability under
+ *	   a specific configuration (1st/2nd-level-only(FLAT), or nested).
+ *	   Nested mode needs to specify which level/stage is concerned.
  * @data: handler private data
  * @faults: holds the pending faults which needs response
  * @lock: protect pending faults list
  */
 struct iommu_fault_param {
 	iommu_dev_fault_handler_t handler;
+#define FAULT_REPORT_FLAT			(1 << 0)
+#define FAULT_REPORT_NESTED_L1			(1 << 1)
+#define FAULT_REPORT_NESTED_L2			(1 << 2)
+	u32 flags;
 	void *data;
 	struct list_head faults;
 	struct mutex lock;
@@ -509,9 +516,11 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
 					 struct notifier_block *nb);
 extern int iommu_group_unregister_notifier(struct iommu_group *group,
 					   struct notifier_block *nb);
+extern int iommu_update_device_fault_handler(struct device *dev,
+					     u32 mask, u32 set);
 extern int iommu_register_device_fault_handler(struct device *dev,
 					iommu_dev_fault_handler_t handler,
-					void *data);
+					u32 flags, void *data);
 
 extern int iommu_unregister_device_fault_handler(struct device *dev);
 
@@ -873,10 +882,16 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
 	return 0;
 }
 
+static inline int iommu_update_device_fault_handler(struct device *dev,
+						    u32 mask, u32 set)
+{
+	return -ENODEV;
+}
+
 static inline
 int iommu_register_device_fault_handler(struct device *dev,
 					iommu_dev_fault_handler_t handler,
-					void *data)
+					u32 flags, void *data)
 {
 	return -ENODEV;
 }
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index e1d9e75f2c94..f6db9c8c970c 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -71,6 +71,7 @@ struct iommu_fault_unrecoverable {
 #define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
 #define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 1)
 #define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 2)
+#define IOMMU_FAULT_UNRECOV_L2			(1 << 3)
 	__u32	flags;
 	__u32	pasid;
 	__u32	perm;
@@ -85,6 +86,8 @@ struct iommu_fault_unrecoverable {
  *         When IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID is set, the page response
  *         must have the same PASID value as the page request. When it is clear,
  *         the page response should not have a PASID.
+ *         If IOMMU_FAULT_PAGE_REQUEST_L2 is set, the fault occurred at the
+ *         second level/stage, otherwise, occurred at the first level.
  * @pasid: Process Address Space ID
  * @grpid: Page Request Group Index
  * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
@@ -96,6 +99,7 @@ struct iommu_fault_page_request {
 #define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
 #define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
 #define IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID	(1 << 3)
+#define IOMMU_FAULT_PAGE_REQUEST_L2		(1 << 4)
 	__u32	flags;
 	__u32	pasid;
 	__u32	grpid;
-- 
2.19.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
  2021-04-09  3:44 ` [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework Shenming Lu
@ 2021-04-09  3:44 ` Shenming Lu
  2021-05-18 18:58   ` Alex Williamson
  2021-04-09  3:44 ` [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning Shenming Lu
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-04-09  3:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui,
	lushenming

VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
serve the reported page faults from the IOMMU driver.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 114 ++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 45cbfd4879a5..ab0ff60ee207 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -101,6 +101,7 @@ struct vfio_dma {
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	unsigned long		*iopf_mapped_bitmap;
 };
 
 struct vfio_batch {
@@ -141,6 +142,16 @@ struct vfio_regions {
 	size_t len;
 };
 
+/* A global IOPF enabled group list */
+static struct rb_root iopf_group_list = RB_ROOT;
+static DEFINE_MUTEX(iopf_group_list_lock);
+
+struct vfio_iopf_group {
+	struct rb_node		node;
+	struct iommu_group	*iommu_group;
+	struct vfio_iommu	*iommu;
+};
+
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
 					(!list_empty(&iommu->domain_list))
 
@@ -157,6 +168,10 @@ struct vfio_regions {
 #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
 #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
 
+#define IOPF_MAPPED_BITMAP_GET(dma, i)	\
+			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
+			       >> ((i) % BITS_PER_LONG)) & 0x1)
+
 #define WAITED 1
 
 static int put_pfn(unsigned long pfn, int prot);
@@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 	return ret;
 }
 
+/*
+ * Helper functions for iopf_group_list
+ */
+static struct vfio_iopf_group *
+vfio_find_iopf_group(struct iommu_group *iommu_group)
+{
+	struct vfio_iopf_group *iopf_group;
+	struct rb_node *node;
+
+	mutex_lock(&iopf_group_list_lock);
+
+	node = iopf_group_list.rb_node;
+
+	while (node) {
+		iopf_group = rb_entry(node, struct vfio_iopf_group, node);
+
+		if (iommu_group < iopf_group->iommu_group)
+			node = node->rb_left;
+		else if (iommu_group > iopf_group->iommu_group)
+			node = node->rb_right;
+		else
+			break;
+	}
+
+	mutex_unlock(&iopf_group_list_lock);
+	return node ? iopf_group : NULL;
+}
+
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
 	struct mm_struct *mm;
@@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 	return -EINVAL;
 }
 
+/* VFIO I/O Page Fault handler */
+static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
+{
+	struct device *dev = (struct device *)data;
+	struct iommu_group *iommu_group;
+	struct vfio_iopf_group *iopf_group;
+	struct vfio_iommu *iommu;
+	struct vfio_dma *dma;
+	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
+	int access_flags = 0;
+	unsigned long bit_offset, vaddr, pfn;
+	int ret;
+	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
+	struct iommu_page_response resp = {0};
+
+	if (fault->type != IOMMU_FAULT_PAGE_REQ)
+		return -EOPNOTSUPP;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return -ENODEV;
+
+	iopf_group = vfio_find_iopf_group(iommu_group);
+	iommu_group_put(iommu_group);
+	if (!iopf_group)
+		return -ENODEV;
+
+	iommu = iopf_group->iommu;
+
+	mutex_lock(&iommu->lock);
+
+	ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
+	if (ret < 0)
+		goto out_invalid;
+
+	if (fault->prm.perm & IOMMU_FAULT_PERM_READ)
+		access_flags |= IOMMU_READ;
+	if (fault->prm.perm & IOMMU_FAULT_PERM_WRITE)
+		access_flags |= IOMMU_WRITE;
+	if ((dma->prot & access_flags) != access_flags)
+		goto out_invalid;
+
+	bit_offset = (iova - dma->iova) >> PAGE_SHIFT;
+	if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset))
+		goto out_success;
+
+	vaddr = iova - dma->iova + dma->vaddr;
+
+	if (vfio_pin_page_external(dma, vaddr, &pfn, true))
+		goto out_invalid;
+
+	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
+		if (put_pfn(pfn, dma->prot))
+			vfio_lock_acct(dma, -1, true);
+		goto out_invalid;
+	}
+
+	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
+
+out_success:
+	status = IOMMU_PAGE_RESP_SUCCESS;
+
+out_invalid:
+	mutex_unlock(&iommu->lock);
+	resp.version		= IOMMU_PAGE_RESP_VERSION_1;
+	resp.grpid		= fault->prm.grpid;
+	resp.code		= status;
+	iommu_page_response(dev, &resp);
+	return 0;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
-- 
2.19.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
  2021-04-09  3:44 ` [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework Shenming Lu
  2021-04-09  3:44 ` [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler Shenming Lu
@ 2021-04-09  3:44 ` Shenming Lu
  2021-05-18 18:58   ` Alex Williamson
  2021-04-09  3:44 ` [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling Shenming Lu
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-04-09  3:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui,
	lushenming

To avoid pinning pages when they are mapped in IOMMU page tables, we
add an MMU notifier to tell the addresses which are no longer valid
and try to unmap them.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 112 +++++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ab0ff60ee207..1cb9d1f2717b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -40,6 +40,7 @@
 #include <linux/notifier.h>
 #include <linux/dma-iommu.h>
 #include <linux/irqdomain.h>
+#include <linux/mmu_notifier.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -69,6 +70,7 @@ struct vfio_iommu {
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	struct blocking_notifier_head notifier;
+	struct mmu_notifier	mn;
 	unsigned int		dma_avail;
 	unsigned int		vaddr_invalid_count;
 	uint64_t		pgsize_bitmap;
@@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	return unlocked;
 }
 
+/* Unmap the IOPF mapped pages in the specified range. */
+static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
+				    struct vfio_dma *dma,
+				    dma_addr_t start, dma_addr_t end)
+{
+	struct iommu_iotlb_gather *gathers;
+	struct vfio_domain *d;
+	int i, num_domains = 0;
+
+	list_for_each_entry(d, &iommu->domain_list, next)
+		num_domains++;
+
+	gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL);
+	if (gathers) {
+		for (i = 0; i < num_domains; i++)
+			iommu_iotlb_gather_init(&gathers[i]);
+	}
+
+	while (start < end) {
+		unsigned long bit_offset;
+		size_t len;
+
+		bit_offset = (start - dma->iova) >> PAGE_SHIFT;
+
+		for (len = 0; start + len < end; len += PAGE_SIZE) {
+			if (!IOPF_MAPPED_BITMAP_GET(dma,
+					bit_offset + (len >> PAGE_SHIFT)))
+				break;
+		}
+
+		if (len) {
+			i = 0;
+			list_for_each_entry(d, &iommu->domain_list, next) {
+				size_t unmapped;
+
+				if (gathers)
+					unmapped = iommu_unmap_fast(d->domain,
+								    start, len,
+								    &gathers[i++]);
+				else
+					unmapped = iommu_unmap(d->domain,
+							       start, len);
+
+				if (WARN_ON(unmapped != len))
+					goto out;
+			}
+
+			bitmap_clear(dma->iopf_mapped_bitmap,
+				     bit_offset, len >> PAGE_SHIFT);
+
+			cond_resched();
+		}
+
+		start += (len + PAGE_SIZE);
+	}
+
+out:
+	if (gathers) {
+		i = 0;
+		list_for_each_entry(d, &iommu->domain_list, next)
+			iommu_iotlb_sync(d->domain, &gathers[i++]);
+
+		kfree(gathers);
+	}
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
@@ -3197,17 +3265,18 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 
 	vaddr = iova - dma->iova + dma->vaddr;
 
-	if (vfio_pin_page_external(dma, vaddr, &pfn, true))
+	if (vfio_pin_page_external(dma, vaddr, &pfn, false))
 		goto out_invalid;
 
 	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
-		if (put_pfn(pfn, dma->prot))
-			vfio_lock_acct(dma, -1, true);
+		put_pfn(pfn, dma->prot);
 		goto out_invalid;
 	}
 
 	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
 
+	put_pfn(pfn, dma->prot);
+
 out_success:
 	status = IOMMU_PAGE_RESP_SUCCESS;
 
@@ -3220,6 +3289,43 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	return 0;
 }
 
+static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
+				unsigned long start, unsigned long end)
+{
+	struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn);
+	struct rb_node *n;
+	int ret;
+
+	mutex_lock(&iommu->lock);
+
+	ret = vfio_wait_all_valid(iommu);
+	if (WARN_ON(ret < 0))
+		return;
+
+	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+		unsigned long start_n, end_n;
+
+		if (end <= dma->vaddr || start >= dma->vaddr + dma->size)
+			continue;
+
+		start_n = ALIGN_DOWN(max_t(unsigned long, start, dma->vaddr),
+				     PAGE_SIZE);
+		end_n = ALIGN(min_t(unsigned long, end, dma->vaddr + dma->size),
+			      PAGE_SIZE);
+
+		vfio_unmap_partial_iopf(iommu, dma,
+					start_n - dma->vaddr + dma->iova,
+					end_n - dma->vaddr + dma->iova);
+	}
+
+	mutex_unlock(&iommu->lock);
+}
+
+static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
+	.invalidate_range	= mn_invalidate_range,
+};
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
-- 
2.19.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
                   ` (2 preceding siblings ...)
  2021-04-09  3:44 ` [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning Shenming Lu
@ 2021-04-09  3:44 ` Shenming Lu
  2021-05-18 18:58   ` Alex Williamson
  2021-04-09  3:44 ` [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF Shenming Lu
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-04-09  3:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui,
	lushenming

To optimize for fewer page fault handlings, we can pre-map more pages
than requested at once.

Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we
could try further tuning.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 131 ++++++++++++++++++++++++++++++--
 1 file changed, 123 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1cb9d1f2717b..01e296c6dc9e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3217,6 +3217,91 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 	return -EINVAL;
 }
 
+/*
+ * To optimize for fewer page fault handlings, try to
+ * pre-map more pages than requested.
+ */
+#define IOPF_PREMAP_LEN		512
+
+/*
+ * Return 0 on success or a negative error code, the
+ * number of pages contiguously pinned is in @pinned.
+ */
+static int pin_pages_iopf(struct vfio_dma *dma, unsigned long vaddr,
+			  unsigned long npages, unsigned long *pfn_base,
+			  unsigned long *pinned, struct vfio_batch *batch)
+{
+	struct mm_struct *mm;
+	unsigned long pfn;
+	int ret = 0;
+	*pinned = 0;
+
+	mm = get_task_mm(dma->task);
+	if (!mm)
+		return -ENODEV;
+
+	if (batch->size) {
+		*pfn_base = page_to_pfn(batch->pages[batch->offset]);
+		pfn = *pfn_base;
+	} else {
+		*pfn_base = 0;
+	}
+
+	while (npages) {
+		if (!batch->size) {
+			unsigned long req_pages = min_t(unsigned long, npages,
+							batch->capacity);
+
+			ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot,
+					     &pfn, batch->pages);
+			if (ret < 0)
+				goto out;
+
+			batch->size = ret;
+			batch->offset = 0;
+			ret = 0;
+
+			if (!*pfn_base)
+				*pfn_base = pfn;
+		}
+
+		while (true) {
+			if (pfn != *pfn_base + *pinned)
+				goto out;
+
+			(*pinned)++;
+			npages--;
+			vaddr += PAGE_SIZE;
+			batch->offset++;
+			batch->size--;
+
+			if (!batch->size)
+				break;
+
+			pfn = page_to_pfn(batch->pages[batch->offset]);
+		}
+
+		if (unlikely(disable_hugepages))
+			break;
+	}
+
+out:
+	if (batch->size == 1 && !batch->offset) {
+		put_pfn(pfn, dma->prot);
+		batch->size = 0;
+	}
+
+	mmput(mm);
+	return ret;
+}
+
+static void unpin_pages_iopf(struct vfio_dma *dma,
+			     unsigned long pfn, unsigned long npages)
+{
+	while (npages--)
+		put_pfn(pfn++, dma->prot);
+}
+
 /* VFIO I/O Page Fault handler */
 static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 {
@@ -3225,9 +3310,11 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	struct vfio_iopf_group *iopf_group;
 	struct vfio_iommu *iommu;
 	struct vfio_dma *dma;
+	struct vfio_batch batch;
 	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
 	int access_flags = 0;
-	unsigned long bit_offset, vaddr, pfn;
+	size_t premap_len, map_len, mapped_len = 0;
+	unsigned long bit_offset, vaddr, pfn, i, npages;
 	int ret;
 	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
 	struct iommu_page_response resp = {0};
@@ -3263,19 +3350,47 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset))
 		goto out_success;
 
+	premap_len = IOPF_PREMAP_LEN << PAGE_SHIFT;
+	npages = dma->size >> PAGE_SHIFT;
+	map_len = PAGE_SIZE;
+	for (i = bit_offset + 1; i < npages; i++) {
+		if (map_len >= premap_len || IOPF_MAPPED_BITMAP_GET(dma, i))
+			break;
+		map_len += PAGE_SIZE;
+	}
 	vaddr = iova - dma->iova + dma->vaddr;
+	vfio_batch_init(&batch);
 
-	if (vfio_pin_page_external(dma, vaddr, &pfn, false))
-		goto out_invalid;
+	while (map_len) {
+		ret = pin_pages_iopf(dma, vaddr + mapped_len,
+				     map_len >> PAGE_SHIFT, &pfn,
+				     &npages, &batch);
+		if (!npages)
+			break;
 
-	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
-		put_pfn(pfn, dma->prot);
-		goto out_invalid;
+		if (vfio_iommu_map(iommu, iova + mapped_len, pfn,
+				   npages, dma->prot)) {
+			unpin_pages_iopf(dma, pfn, npages);
+			vfio_batch_unpin(&batch, dma);
+			break;
+		}
+
+		bitmap_set(dma->iopf_mapped_bitmap,
+			   bit_offset + (mapped_len >> PAGE_SHIFT), npages);
+
+		unpin_pages_iopf(dma, pfn, npages);
+
+		map_len -= npages << PAGE_SHIFT;
+		mapped_len += npages << PAGE_SHIFT;
+
+		if (ret)
+			break;
 	}
 
-	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
+	vfio_batch_fini(&batch);
 
-	put_pfn(pfn, dma->prot);
+	if (!mapped_len)
+		goto out_invalid;
 
 out_success:
 	status = IOMMU_PAGE_RESP_SUCCESS;
-- 
2.19.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
                   ` (3 preceding siblings ...)
  2021-04-09  3:44 ` [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling Shenming Lu
@ 2021-04-09  3:44 ` Shenming Lu
  2021-05-18 18:58   ` Alex Williamson
  2021-04-09  3:44 ` [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled Shenming Lu
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-04-09  3:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui,
	lushenming

Since enabling IOPF for devices may lead to a slow ramp up of performance,
we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
registering the VFIO IOPF handler.

Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
inflight page faults when disabling.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |   6 +
 2 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 01e296c6dc9e..7df5711e743a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -71,6 +71,7 @@ struct vfio_iommu {
 	struct rb_root		dma_list;
 	struct blocking_notifier_head notifier;
 	struct mmu_notifier	mn;
+	struct mm_struct	*mm;
 	unsigned int		dma_avail;
 	unsigned int		vaddr_invalid_count;
 	uint64_t		pgsize_bitmap;
@@ -81,6 +82,7 @@ struct vfio_iommu {
 	bool			dirty_page_tracking;
 	bool			pinned_page_dirty_scope;
 	bool			container_open;
+	bool			iopf_enabled;
 };
 
 struct vfio_domain {
@@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
 	return node ? iopf_group : NULL;
 }
 
+static void vfio_link_iopf_group(struct vfio_iopf_group *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_iopf_group *iopf_group;
+
+	mutex_lock(&iopf_group_list_lock);
+
+	link = &iopf_group_list.rb_node;
+
+	while (*link) {
+		parent = *link;
+		iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
+
+		if (new->iommu_group < iopf_group->iommu_group)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iopf_group_list);
+
+	mutex_unlock(&iopf_group_list_lock);
+}
+
+static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
+{
+	mutex_lock(&iopf_group_list_lock);
+	rb_erase(&old->node, &iopf_group_list);
+	mutex_unlock(&iopf_group_list_lock);
+}
+
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
 	struct mm_struct *mm;
@@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
 	list_splice_tail(iova_copy, iova);
 }
 
+static int vfio_dev_domian_nested(struct device *dev, int *nested)
+{
+	struct iommu_domain *domain;
+
+	domain = iommu_get_domain_for_dev(dev);
+	if (!domain)
+		return -ENODEV;
+
+	return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);
+}
+
+static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
+
+static int dev_enable_iopf(struct device *dev, void *data)
+{
+	int *enabled_dev_cnt = data;
+	int nested;
+	u32 flags;
+	int ret;
+
+	ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+	if (ret)
+		return ret;
+
+	ret = vfio_dev_domian_nested(dev, &nested);
+	if (ret)
+		goto out_disable;
+
+	if (nested)
+		flags = FAULT_REPORT_NESTED_L2;
+	else
+		flags = FAULT_REPORT_FLAT;
+
+	ret = iommu_register_device_fault_handler(dev,
+				vfio_iommu_type1_dma_map_iopf, flags, dev);
+	if (ret)
+		goto out_disable;
+
+	(*enabled_dev_cnt)++;
+	return 0;
+
+out_disable:
+	iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+	return ret;
+}
+
+static int dev_disable_iopf(struct device *dev, void *data)
+{
+	int *enabled_dev_cnt = data;
+
+	if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
+		return -1;
+
+	WARN_ON(iommu_unregister_device_fault_handler(dev));
+	WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
+
+	if (enabled_dev_cnt)
+		(*enabled_dev_cnt)--;
+
+	return 0;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
@@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	struct iommu_domain_geometry geo;
 	LIST_HEAD(iova_copy);
 	LIST_HEAD(group_resv_regions);
+	int iopf_enabled_dev_cnt = 0;
+	struct vfio_iopf_group *iopf_group = NULL;
 
 	mutex_lock(&iommu->lock);
 
@@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_domain;
 
+	if (iommu->iopf_enabled) {
+		ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
+					       dev_enable_iopf);
+		if (ret)
+			goto out_detach;
+
+		iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
+		if (!iopf_group) {
+			ret = -ENOMEM;
+			goto out_detach;
+		}
+
+		iopf_group->iommu_group = iommu_group;
+		iopf_group->iommu = iommu;
+
+		vfio_link_iopf_group(iopf_group);
+	}
+
 	/* Get aperture info */
 	iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
 
@@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	vfio_test_domain_fgsp(domain);
 
 	/* replay mappings on new domains */
-	ret = vfio_iommu_replay(iommu, domain);
-	if (ret)
-		goto out_detach;
+	if (!iommu->iopf_enabled) {
+		ret = vfio_iommu_replay(iommu, domain);
+		if (ret)
+			goto out_detach;
+	}
 
 	if (resv_msi) {
 		ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
@@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	iommu_domain_free(domain->domain);
 	vfio_iommu_iova_free(&iova_copy);
 	vfio_iommu_resv_free(&group_resv_regions);
+	if (iommu->iopf_enabled) {
+		if (iopf_group) {
+			vfio_unlink_iopf_group(iopf_group);
+			kfree(iopf_group);
+		}
+
+		iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
+					 dev_disable_iopf);
+	}
 out_free:
 	kfree(domain);
 	kfree(group);
@@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 		if (!group)
 			continue;
 
+		if (iommu->iopf_enabled) {
+			struct vfio_iopf_group *iopf_group;
+
+			iopf_group = vfio_find_iopf_group(iommu_group);
+			if (!WARN_ON(!iopf_group)) {
+				vfio_unlink_iopf_group(iopf_group);
+				kfree(iopf_group);
+			}
+
+			iommu_group_for_each_dev(iommu_group, NULL,
+						 dev_disable_iopf);
+		}
+
 		vfio_iommu_detach_group(domain, group);
 		update_dirty_scope = !group->pinned_page_dirty_scope;
 		list_del(&group->next);
@@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
 
 	vfio_iommu_iova_free(&iommu->iova_list);
 
+	if (iommu->iopf_enabled) {
+		mmu_notifier_unregister(&iommu->mn, iommu->mm);
+		mmdrop(iommu->mm);
+	}
+
 	kfree(iommu);
 }
 
@@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
 	.invalidate_range	= mn_invalidate_range,
 };
 
+static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *d;
+	struct vfio_group *g;
+	struct vfio_iopf_group *iopf_group;
+	int enabled_dev_cnt = 0;
+	int ret;
+
+	if (!current->mm)
+		return -ENODEV;
+
+	mutex_lock(&iommu->lock);
+
+	mmgrab(current->mm);
+	iommu->mm = current->mm;
+	iommu->mn.ops = &vfio_iommu_type1_mn_ops;
+	ret = mmu_notifier_register(&iommu->mn, current->mm);
+	if (ret)
+		goto out_drop;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		list_for_each_entry(g, &d->group_list, next) {
+			ret = iommu_group_for_each_dev(g->iommu_group,
+					&enabled_dev_cnt, dev_enable_iopf);
+			if (ret)
+				goto out_unwind;
+
+			iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
+			if (!iopf_group) {
+				ret = -ENOMEM;
+				goto out_unwind;
+			}
+
+			iopf_group->iommu_group = g->iommu_group;
+			iopf_group->iommu = iommu;
+
+			vfio_link_iopf_group(iopf_group);
+		}
+	}
+
+	iommu->iopf_enabled = true;
+	goto out_unlock;
+
+out_unwind:
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		list_for_each_entry(g, &d->group_list, next) {
+			iopf_group = vfio_find_iopf_group(g->iommu_group);
+			if (iopf_group) {
+				vfio_unlink_iopf_group(iopf_group);
+				kfree(iopf_group);
+			}
+
+			if (iommu_group_for_each_dev(g->iommu_group,
+					&enabled_dev_cnt, dev_disable_iopf))
+				goto out_unregister;
+		}
+	}
+
+out_unregister:
+	mmu_notifier_unregister(&iommu->mn, current->mm);
+
+out_drop:
+	iommu->mm = NULL;
+	mmdrop(current->mm);
+
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -3457,6 +3672,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		return vfio_iommu_type1_unmap_dma(iommu, arg);
 	case VFIO_IOMMU_DIRTY_PAGES:
 		return vfio_iommu_type1_dirty_pages(iommu, arg);
+	case VFIO_IOMMU_ENABLE_IOPF:
+		return vfio_iommu_type1_enable_iopf(iommu);
 	default:
 		return -ENOTTY;
 	}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8ce36c1d53ca..5497036bebdc 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1208,6 +1208,12 @@ struct vfio_iommu_type1_dirty_bitmap_get {
 
 #define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
 
+/*
+ * IOCTL to enable IOPF for the container.
+ * Called right after VFIO_SET_IOMMU.
+ */
+#define VFIO_IOMMU_ENABLE_IOPF             _IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.19.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
                   ` (4 preceding siblings ...)
  2021-04-09  3:44 ` [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF Shenming Lu
@ 2021-04-09  3:44 ` Shenming Lu
  2021-05-18 18:58   ` Alex Williamson
  2021-04-09  3:44 ` [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support Shenming Lu
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-04-09  3:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui,
	lushenming

If IOPF enabled for the VFIO container, there is no need to statically
pin and map the entire DMA range, we can do it on demand. And unmap
according to the IOPF mapped bitmap when removing vfio_dma.

Note that we still mark all pages dirty even if IOPF enabled, we may
add IOPF-based fine grained dirty tracking support in the future.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 38 +++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 7df5711e743a..dcc93c3b258c 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -175,6 +175,7 @@ struct vfio_iopf_group {
 #define IOPF_MAPPED_BITMAP_GET(dma, i)	\
 			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
 			       >> ((i) % BITS_PER_LONG)) & 0x1)
+#define IOPF_MAPPED_BITMAP_BYTES(n)	DIRTY_BITMAP_BYTES(n)
 
 #define WAITED 1
 
@@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	 * already pinned and accounted. Accouting should be done if there is no
 	 * iommu capable domain in the container.
 	 */
-	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+			iommu->iopf_enabled;
 
 	for (i = 0; i < npage; i++) {
 		struct vfio_pfn *vpfn;
@@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+			iommu->iopf_enabled;
 	for (i = 0; i < npage; i++) {
 		struct vfio_dma *dma;
 		dma_addr_t iova;
@@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	if (!dma->size)
 		return 0;
 
-	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
 		return 0;
 
 	/*
@@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
 	}
 }
 
+static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma)
+{
+	vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
+
+	kfree(dma->iopf_mapped_bitmap);
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
+	if (iommu->iopf_enabled)
+		vfio_dma_clean_iopf(iommu, dma);
 	put_task_struct(dma->task);
 	vfio_dma_bitmap_free(dma);
 	if (dma->vaddr_invalid) {
@@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 	 * mark all pages dirty if any IOMMU capable device is not able
 	 * to report dirty pages and all pages are pinned and mapped.
 	 */
-	if (iommu->num_non_pinned_groups && dma->iommu_mapped)
+	if (iommu->num_non_pinned_groups &&
+	    (dma->iommu_mapped || iommu->iopf_enabled))
 		bitmap_set(dma->bitmap, 0, nbits);
 
 	if (shift) {
@@ -1772,6 +1785,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		goto out_unlock;
 	}
 
+	if (iommu->iopf_enabled) {
+		dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES(
+						size >> PAGE_SHIFT), GFP_KERNEL);
+		if (!dma->iopf_mapped_bitmap) {
+			ret = -ENOMEM;
+			kfree(dma);
+			goto out_unlock;
+		}
+	}
+
 	iommu->dma_avail--;
 	dma->iova = iova;
 	dma->vaddr = vaddr;
@@ -1811,8 +1834,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
-	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
-	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+	/*
+	 * Don't pin and map if container doesn't contain IOMMU capable domain,
+	 * or IOPF enabled for the container.
+	 */
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
 		dma->size = size;
 	else
 		ret = vfio_pin_map_dma(iommu, dma, size);
-- 
2.19.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
                   ` (5 preceding siblings ...)
  2021-04-09  3:44 ` [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled Shenming Lu
@ 2021-04-09  3:44 ` Shenming Lu
  2021-05-18 18:58   ` Alex Williamson
  2021-04-09  3:44 ` [RFC PATCH v3 8/8] vfio: Add nested IOPF support Shenming Lu
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-04-09  3:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui,
	lushenming

Some devices only allow selective DMA faulting. Similar to the selective
dirty page tracking, the vendor driver can call vfio_pin_pages() to
indicate the non-faultable scope, we add a new struct vfio_range to
record it, then when the IOPF handler receives any page request out
of the scope, we can directly return with an invalid response.

Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio.c             |   4 +-
 drivers/vfio/vfio_iommu_type1.c | 357 +++++++++++++++++++++++++++++++-
 include/linux/vfio.h            |   1 +
 3 files changed, 358 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 38779e6fd80c..44c8dfabf7de 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2013,7 +2013,8 @@ int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, int npage)
 	container = group->container;
 	driver = container->iommu_driver;
 	if (likely(driver && driver->ops->unpin_pages))
-		ret = driver->ops->unpin_pages(container->iommu_data, user_pfn,
+		ret = driver->ops->unpin_pages(container->iommu_data,
+					       group->iommu_group, user_pfn,
 					       npage);
 	else
 		ret = -ENOTTY;
@@ -2112,6 +2113,7 @@ int vfio_group_unpin_pages(struct vfio_group *group,
 	driver = container->iommu_driver;
 	if (likely(driver && driver->ops->unpin_pages))
 		ret = driver->ops->unpin_pages(container->iommu_data,
+					       group->iommu_group,
 					       user_iova_pfn, npage);
 	else
 		ret = -ENOTTY;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index dcc93c3b258c..ba2b5a1cf6e9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -150,10 +150,19 @@ struct vfio_regions {
 static struct rb_root iopf_group_list = RB_ROOT;
 static DEFINE_MUTEX(iopf_group_list_lock);
 
+struct vfio_range {
+	struct rb_node		node;
+	dma_addr_t		base_iova;
+	size_t			span;
+	unsigned int		ref_count;
+};
+
 struct vfio_iopf_group {
 	struct rb_node		node;
 	struct iommu_group	*iommu_group;
 	struct vfio_iommu	*iommu;
+	struct rb_root		pinned_range_list;
+	bool			selective_faulting;
 };
 
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
@@ -496,6 +505,255 @@ static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
 	mutex_unlock(&iopf_group_list_lock);
 }
 
+/*
+ * Helper functions for range list, handle one page at a time.
+ */
+static struct vfio_range *vfio_find_range(struct rb_root *range_list,
+					  dma_addr_t iova)
+{
+	struct rb_node *node = range_list->rb_node;
+	struct vfio_range *range;
+
+	while (node) {
+		range = rb_entry(node, struct vfio_range, node);
+
+		if (iova + PAGE_SIZE <= range->base_iova)
+			node = node->rb_left;
+		else if (iova >= range->base_iova + range->span)
+			node = node->rb_right;
+		else
+			return range;
+	}
+
+	return NULL;
+}
+
+/* Do the possible merge adjacent to the input range. */
+static void vfio_merge_range_list(struct rb_root *range_list,
+				  struct vfio_range *range)
+{
+	struct rb_node *node_prev = rb_prev(&range->node);
+	struct rb_node *node_next = rb_next(&range->node);
+
+	if (node_next) {
+		struct vfio_range *range_next = rb_entry(node_next,
+							 struct vfio_range,
+							 node);
+
+		if (range_next->base_iova == (range->base_iova + range->span) &&
+		    range_next->ref_count == range->ref_count) {
+			rb_erase(node_next, range_list);
+			range->span += range_next->span;
+			kfree(range_next);
+		}
+	}
+
+	if (node_prev) {
+		struct vfio_range *range_prev = rb_entry(node_prev,
+							 struct vfio_range,
+							 node);
+
+		if (range->base_iova == (range_prev->base_iova + range_prev->span)
+		    && range->ref_count == range_prev->ref_count) {
+			rb_erase(&range->node, range_list);
+			range_prev->span += range->span;
+			kfree(range);
+		}
+	}
+}
+
+static void vfio_link_range(struct rb_root *range_list, struct vfio_range *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_range *range;
+
+	link = &range_list->rb_node;
+
+	while (*link) {
+		parent = *link;
+		range = rb_entry(parent, struct vfio_range, node);
+
+		if (new->base_iova < range->base_iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, range_list);
+
+	vfio_merge_range_list(range_list, new);
+}
+
+static int vfio_add_to_range_list(struct rb_root *range_list,
+				  dma_addr_t iova)
+{
+	struct vfio_range *range = vfio_find_range(range_list, iova);
+
+	if (range) {
+		struct vfio_range *new_prev, *new_next;
+		size_t span_prev, span_next;
+
+		/* May split the found range into three parts. */
+		span_prev = iova - range->base_iova;
+		span_next = range->span - span_prev - PAGE_SIZE;
+
+		if (span_prev) {
+			new_prev = kzalloc(sizeof(*new_prev), GFP_KERNEL);
+			if (!new_prev)
+				return -ENOMEM;
+
+			new_prev->base_iova = range->base_iova;
+			new_prev->span = span_prev;
+			new_prev->ref_count = range->ref_count;
+		}
+
+		if (span_next) {
+			new_next = kzalloc(sizeof(*new_next), GFP_KERNEL);
+			if (!new_next) {
+				if (span_prev)
+					kfree(new_prev);
+				return -ENOMEM;
+			}
+
+			new_next->base_iova = iova + PAGE_SIZE;
+			new_next->span = span_next;
+			new_next->ref_count = range->ref_count;
+		}
+
+		range->base_iova = iova;
+		range->span = PAGE_SIZE;
+		range->ref_count++;
+		vfio_merge_range_list(range_list, range);
+
+		if (span_prev)
+			vfio_link_range(range_list, new_prev);
+
+		if (span_next)
+			vfio_link_range(range_list, new_next);
+	} else {
+		struct vfio_range *new;
+
+		new = kzalloc(sizeof(*new), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		new->base_iova = iova;
+		new->span = PAGE_SIZE;
+		new->ref_count = 1;
+
+		vfio_link_range(range_list, new);
+	}
+
+	return 0;
+}
+
+static int vfio_remove_from_range_list(struct rb_root *range_list,
+				       dma_addr_t iova)
+{
+	struct vfio_range *range = vfio_find_range(range_list, iova);
+	struct vfio_range *news[3];
+	size_t span_prev, span_in, span_next;
+	int i, num_news;
+
+	if (!range)
+		return 0;
+
+	span_prev = iova - range->base_iova;
+	span_in = range->ref_count > 1 ? PAGE_SIZE : 0;
+	span_next = range->span - span_prev - PAGE_SIZE;
+
+	num_news = (int)!!span_prev + (int)!!span_in + (int)!!span_next;
+	if (!num_news) {
+		rb_erase(&range->node, range_list);
+		kfree(range);
+		return 0;
+	}
+
+	for (i = 0; i < num_news - 1; i++) {
+		news[i] = kzalloc(sizeof(struct vfio_range), GFP_KERNEL);
+		if (!news[i]) {
+			if (i > 0)
+				kfree(news[0]);
+			return -ENOMEM;
+		}
+	}
+	/* Reuse the found range. */
+	news[i] = range;
+
+	i = 0;
+	if (span_prev) {
+		news[i]->base_iova = range->base_iova;
+		news[i]->span = span_prev;
+		news[i++]->ref_count = range->ref_count;
+	}
+	if (span_in) {
+		news[i]->base_iova = iova;
+		news[i]->span = span_in;
+		news[i++]->ref_count = range->ref_count - 1;
+	}
+	if (span_next) {
+		news[i]->base_iova = iova + PAGE_SIZE;
+		news[i]->span = span_next;
+		news[i]->ref_count = range->ref_count;
+	}
+
+	vfio_merge_range_list(range_list, range);
+
+	for (i = 0; i < num_news - 1; i++)
+		vfio_link_range(range_list, news[i]);
+
+	return 0;
+}
+
+static void vfio_range_list_free(struct rb_root *range_list)
+{
+	struct rb_node *n;
+
+	while ((n = rb_first(range_list))) {
+		struct vfio_range *range = rb_entry(n, struct vfio_range, node);
+
+		rb_erase(&range->node, range_list);
+		kfree(range);
+	}
+}
+
+static int vfio_range_list_get_copy(struct vfio_iopf_group *iopf_group,
+				    struct rb_root *range_list_copy)
+{
+	struct rb_root *range_list = &iopf_group->pinned_range_list;
+	struct rb_node *n, **link = &range_list_copy->rb_node, *parent = NULL;
+	int ret;
+
+	for (n = rb_first(range_list); n; n = rb_next(n)) {
+		struct vfio_range *range, *range_copy;
+
+		range = rb_entry(n, struct vfio_range, node);
+
+		range_copy = kzalloc(sizeof(*range_copy), GFP_KERNEL);
+		if (!range_copy) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+
+		range_copy->base_iova = range->base_iova;
+		range_copy->span = range->span;
+		range_copy->ref_count = range->ref_count;
+
+		rb_link_node(&range_copy->node, parent, link);
+		rb_insert_color(&range_copy->node, range_list_copy);
+
+		parent = *link;
+		link = &(*link)->rb_right;
+	}
+
+	return 0;
+
+out_free:
+	vfio_range_list_free(range_list_copy);
+	return ret;
+}
+
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
 	struct mm_struct *mm;
@@ -910,6 +1168,9 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
 	return unlocked;
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+					   struct iommu_group *iommu_group);
+
 static int vfio_iommu_type1_pin_pages(void *iommu_data,
 				      struct iommu_group *iommu_group,
 				      unsigned long *user_pfn,
@@ -923,6 +1184,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	struct vfio_dma *dma;
 	bool do_accounting;
 	dma_addr_t iova;
+	struct vfio_iopf_group *iopf_group = NULL;
+	struct rb_root range_list_copy = RB_ROOT;
 
 	if (!iommu || !user_pfn || !phys_pfn)
 		return -EINVAL;
@@ -955,6 +1218,31 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		goto pin_done;
 	}
 
+	/*
+	 * Some devices only allow selective DMA faulting. Similar to the
+	 * selective dirty tracking, the vendor driver can call vfio_pin_pages()
+	 * to indicate the non-faultable scope, and we record it to filter
+	 * out the invalid page requests in the IOPF handler.
+	 */
+	if (iommu->iopf_enabled) {
+		iopf_group = vfio_find_iopf_group(iommu_group);
+		if (iopf_group) {
+			/*
+			 * We don't want to work on the original range
+			 * list as the list gets modified and in case
+			 * of failure we have to retain the original
+			 * list. Get a copy here.
+			 */
+			ret = vfio_range_list_get_copy(iopf_group,
+						       &range_list_copy);
+			if (ret)
+				goto pin_done;
+		} else {
+			WARN_ON(!find_iommu_group(iommu->external_domain,
+						  iommu_group));
+		}
+	}
+
 	/*
 	 * If iommu capable domain exist in the container then all pages are
 	 * already pinned and accounted. Accouting should be done if there is no
@@ -981,6 +1269,15 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
 		if (vpfn) {
 			phys_pfn[i] = vpfn->pfn;
+			if (iopf_group) {
+				ret = vfio_add_to_range_list(&range_list_copy,
+							     iova);
+				if (ret) {
+					vfio_unpin_page_external(dma, iova,
+								 do_accounting);
+					goto pin_unwind;
+				}
+			}
 			continue;
 		}
 
@@ -997,6 +1294,15 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 			goto pin_unwind;
 		}
 
+		if (iopf_group) {
+			ret = vfio_add_to_range_list(&range_list_copy, iova);
+			if (ret) {
+				vfio_unpin_page_external(dma, iova,
+							 do_accounting);
+				goto pin_unwind;
+			}
+		}
+
 		if (iommu->dirty_page_tracking) {
 			unsigned long pgshift = __ffs(iommu->pgsize_bitmap);
 
@@ -1010,6 +1316,13 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	}
 	ret = i;
 
+	if (iopf_group) {
+		vfio_range_list_free(&iopf_group->pinned_range_list);
+		iopf_group->pinned_range_list.rb_node = range_list_copy.rb_node;
+		if (!iopf_group->selective_faulting)
+			iopf_group->selective_faulting = true;
+	}
+
 	group = vfio_iommu_find_iommu_group(iommu, iommu_group);
 	if (!group->pinned_page_dirty_scope) {
 		group->pinned_page_dirty_scope = true;
@@ -1019,6 +1332,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	goto pin_done;
 
 pin_unwind:
+	if (iopf_group)
+		vfio_range_list_free(&range_list_copy);
 	phys_pfn[i] = 0;
 	for (j = 0; j < i; j++) {
 		dma_addr_t iova;
@@ -1034,12 +1349,14 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 }
 
 static int vfio_iommu_type1_unpin_pages(void *iommu_data,
+					struct iommu_group *iommu_group,
 					unsigned long *user_pfn,
 					int npage)
 {
 	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_iopf_group *iopf_group = NULL;
 	bool do_accounting;
-	int i;
+	int i, ret;
 
 	if (!iommu || !user_pfn)
 		return -EINVAL;
@@ -1050,6 +1367,13 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
+	if (iommu->iopf_enabled) {
+		iopf_group = vfio_find_iopf_group(iommu_group);
+		if (!iopf_group)
+			WARN_ON(!find_iommu_group(iommu->external_domain,
+						  iommu_group));
+	}
+
 	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
 			iommu->iopf_enabled;
 	for (i = 0; i < npage; i++) {
@@ -1058,14 +1382,24 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 
 		iova = user_pfn[i] << PAGE_SHIFT;
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
-		if (!dma)
+		if (!dma) {
+			ret = -EINVAL;
 			goto unpin_exit;
+		}
+
+		if (iopf_group) {
+			ret = vfio_remove_from_range_list(
+					&iopf_group->pinned_range_list, iova);
+			if (ret)
+				goto unpin_exit;
+		}
+
 		vfio_unpin_page_external(dma, iova, do_accounting);
 	}
 
 unpin_exit:
 	mutex_unlock(&iommu->lock);
-	return i > npage ? npage : (i > 0 ? i : -EINVAL);
+	return i > npage ? npage : (i > 0 ? i : ret);
 }
 
 static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
@@ -2591,6 +2925,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 		iopf_group->iommu_group = iommu_group;
 		iopf_group->iommu = iommu;
+		iopf_group->pinned_range_list = RB_ROOT;
 
 		vfio_link_iopf_group(iopf_group);
 	}
@@ -2886,6 +3221,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 			iopf_group = vfio_find_iopf_group(iommu_group);
 			if (!WARN_ON(!iopf_group)) {
+				WARN_ON(!RB_EMPTY_ROOT(
+						&iopf_group->pinned_range_list));
 				vfio_unlink_iopf_group(iopf_group);
 				kfree(iopf_group);
 			}
@@ -3482,6 +3819,7 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	struct vfio_iommu *iommu;
 	struct vfio_dma *dma;
 	struct vfio_batch batch;
+	struct vfio_range *range;
 	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
 	int access_flags = 0;
 	size_t premap_len, map_len, mapped_len = 0;
@@ -3506,6 +3844,12 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 
 	mutex_lock(&iommu->lock);
 
+	if (iopf_group->selective_faulting) {
+		range = vfio_find_range(&iopf_group->pinned_range_list, iova);
+		if (!range)
+			goto out_invalid;
+	}
+
 	ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
 	if (ret < 0)
 		goto out_invalid;
@@ -3523,6 +3867,12 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 
 	premap_len = IOPF_PREMAP_LEN << PAGE_SHIFT;
 	npages = dma->size >> PAGE_SHIFT;
+	if (iopf_group->selective_faulting) {
+		dma_addr_t range_end = range->base_iova + range->span;
+
+		if (range_end < dma->iova + dma->size)
+			npages = (range_end - dma->iova) >> PAGE_SHIFT;
+	}
 	map_len = PAGE_SIZE;
 	for (i = bit_offset + 1; i < npages; i++) {
 		if (map_len >= premap_len || IOPF_MAPPED_BITMAP_GET(dma, i))
@@ -3647,6 +3997,7 @@ static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
 
 			iopf_group->iommu_group = g->iommu_group;
 			iopf_group->iommu = iommu;
+			iopf_group->pinned_range_list = RB_ROOT;
 
 			vfio_link_iopf_group(iopf_group);
 		}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b7e18bde5aa8..a7b426d579df 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -87,6 +87,7 @@ struct vfio_iommu_driver_ops {
 				     int npage, int prot,
 				     unsigned long *phys_pfn);
 	int		(*unpin_pages)(void *iommu_data,
+				       struct iommu_group *group,
 				       unsigned long *user_pfn, int npage);
 	int		(*register_notifier)(void *iommu_data,
 					     unsigned long *events,
-- 
2.19.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH v3 8/8] vfio: Add nested IOPF support
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
                   ` (6 preceding siblings ...)
  2021-04-09  3:44 ` [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support Shenming Lu
@ 2021-04-09  3:44 ` Shenming Lu
  2021-05-18 18:58   ` Alex Williamson
  2021-04-26  1:41 ` [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
  2021-05-18 18:57 ` Alex Williamson
  9 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-04-09  3:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui,
	lushenming

To set up nested mode, drivers such as vfio_pci need to register a
handler to receive stage/level 1 faults from the IOMMU, but since
currently each device can only have one iommu dev fault handler,
and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
we choose to update the registered handler (a consolidated one) via
flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
stage 1 faults in the handler to the guest through a newly added
vfio_device_ops callback.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio.c             | 81 +++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
 include/linux/vfio.h            | 12 +++++
 3 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 44c8dfabf7de..4245f15914bf 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
 }
 EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
 
+/*
+ * Register/Update the VFIO IOPF handler to receive
+ * nested stage/level 1 faults.
+ */
+int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	int ret;
+
+	if (!dev)
+		return -EINVAL;
+
+	group = vfio_group_get_from_dev(dev);
+	if (!group)
+		return -ENODEV;
+
+	ret = vfio_group_add_container_user(group);
+	if (ret)
+		goto out;
+
+	container = group->container;
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->register_handler))
+		ret = driver->ops->register_handler(container->iommu_data, dev);
+	else
+		ret = -ENOTTY;
+
+	vfio_group_try_dissolve_container(group);
+
+out:
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
+
+int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	int ret;
+
+	if (!dev)
+		return -EINVAL;
+
+	group = vfio_group_get_from_dev(dev);
+	if (!group)
+		return -ENODEV;
+
+	ret = vfio_group_add_container_user(group);
+	if (ret)
+		goto out;
+
+	container = group->container;
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->unregister_handler))
+		ret = driver->ops->unregister_handler(container->iommu_data, dev);
+	else
+		ret = -ENOTTY;
+
+	vfio_group_try_dissolve_container(group);
+
+out:
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
+
+int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
+{
+	struct vfio_device *device = dev_get_drvdata(dev);
+
+	if (unlikely(!device->ops->transfer))
+		return -EOPNOTSUPP;
+
+	return device->ops->transfer(device->device_data, fault);
+}
+EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ba2b5a1cf6e9..9d1adeddb303 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	struct vfio_batch batch;
 	struct vfio_range *range;
 	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
-	int access_flags = 0;
+	int access_flags = 0, nested;
 	size_t premap_len, map_len, mapped_len = 0;
 	unsigned long bit_offset, vaddr, pfn, i, npages;
 	int ret;
 	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
 	struct iommu_page_response resp = {0};
 
+	if (vfio_dev_domian_nested(dev, &nested))
+		return -ENODEV;
+
+	/*
+	 * When configured in nested mode, further deliver the
+	 * stage/level 1 faults to the guest.
+	 */
+	if (nested) {
+		bool l2;
+
+		if (fault->type == IOMMU_FAULT_PAGE_REQ)
+			l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
+		if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
+			l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
+
+		if (!l2)
+			return vfio_transfer_iommu_fault(dev, fault);
+	}
+
 	if (fault->type != IOMMU_FAULT_PAGE_REQ)
 		return -EOPNOTSUPP;
 
@@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
 	wake_up_all(&iommu->vaddr_wait);
 }
 
+static int vfio_iommu_type1_register_handler(void *iommu_data,
+					     struct device *dev)
+{
+	struct vfio_iommu *iommu = iommu_data;
+
+	if (iommu->iopf_enabled)
+		return iommu_update_device_fault_handler(dev, ~0,
+						FAULT_REPORT_NESTED_L1);
+	else
+		return iommu_register_device_fault_handler(dev,
+						vfio_iommu_type1_dma_map_iopf,
+						FAULT_REPORT_NESTED_L1, dev);
+}
+
+static int vfio_iommu_type1_unregister_handler(void *iommu_data,
+					       struct device *dev)
+{
+	struct vfio_iommu *iommu = iommu_data;
+
+	if (iommu->iopf_enabled)
+		return iommu_update_device_fault_handler(dev,
+						~FAULT_REPORT_NESTED_L1, 0);
+	else
+		return iommu_unregister_device_fault_handler(dev);
+}
+
 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.name			= "vfio-iommu-type1",
 	.owner			= THIS_MODULE,
@@ -4216,6 +4261,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.dma_rw			= vfio_iommu_type1_dma_rw,
 	.group_iommu_domain	= vfio_iommu_type1_group_iommu_domain,
 	.notify			= vfio_iommu_type1_notify,
+	.register_handler	= vfio_iommu_type1_register_handler,
+	.unregister_handler	= vfio_iommu_type1_unregister_handler,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index a7b426d579df..4621d8f0395d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -29,6 +29,8 @@
  * @match: Optional device name match callback (return: 0 for no-match, >0 for
  *         match, -errno for abort (ex. match with insufficient or incorrect
  *         additional args)
+ * @transfer: Optional. Transfer the received stage/level 1 faults to the guest
+ *            for nested mode.
  */
 struct vfio_device_ops {
 	char	*name;
@@ -43,6 +45,7 @@ struct vfio_device_ops {
 	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
 	void	(*request)(void *device_data, unsigned int count);
 	int	(*match)(void *device_data, char *buf);
+	int	(*transfer)(void *device_data, struct iommu_fault *fault);
 };
 
 extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
@@ -100,6 +103,10 @@ struct vfio_iommu_driver_ops {
 						   struct iommu_group *group);
 	void		(*notify)(void *iommu_data,
 				  enum vfio_iommu_notify_type event);
+	int		(*register_handler)(void *iommu_data,
+					    struct device *dev);
+	int		(*unregister_handler)(void *iommu_data,
+					      struct device *dev);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -161,6 +168,11 @@ extern int vfio_unregister_notifier(struct device *dev,
 struct kvm;
 extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
 
+extern int vfio_iommu_dev_fault_handler_register_nested(struct device *dev);
+extern int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev);
+extern int vfio_transfer_iommu_fault(struct device *dev,
+				     struct iommu_fault *fault);
+
 /*
  * Sub-module helpers
  */
-- 
2.19.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
                   ` (7 preceding siblings ...)
  2021-04-09  3:44 ` [RFC PATCH v3 8/8] vfio: Add nested IOPF support Shenming Lu
@ 2021-04-26  1:41 ` Shenming Lu
  2021-05-11 11:30   ` Shenming Lu
  2021-05-18 18:57 ` Alex Williamson
  9 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-04-26  1:41 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, Eric Auger, kvm,
	linux-kernel, linux-arm-kernel, iommu, linux-api
  Cc: Kevin Tian, Lu Baolu, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui

On 2021/4/9 11:44, Shenming Lu wrote:
> Hi,
> 
> Requesting for your comments and suggestions. :-)

Kind ping...

> 
> The static pinning and mapping problem in VFIO and possible solutions
> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> Page Fault support for VFIO devices. Different from those relatively
> complicated software approaches such as presenting a vIOMMU that provides
> the DMA buffer information (might include para-virtualized optimizations),
> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> support for VFIO passthrough based on the IOPF part of SVA in this series.
> 
> We have measured its performance with UADK [4] (passthrough an accelerator
> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
> 
> Run hisi_sec_test...
>  - with varying sending times and message lengths
>  - with/without IOPF enabled (speed slowdown)
> 
> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>             slowdown (num of faults)
>  times      VFIO IOPF      host SVA
>  1          63.4% (518)    82.8% (512)
>  100        22.9% (1058)   47.9% (1024)
>  1000       2.6% (1071)    8.5% (1024)
> 
> when msg_len = 10MB (and PREMAP_LEN = 512):
>             slowdown (num of faults)
>  times      VFIO IOPF
>  1          32.6% (13)
>  100        3.5% (26)
>  1000       1.6% (26)
> 
> History:
> 
> v2 -> v3
>  - Nit fixes.
>  - No reason to disable reporting the unrecoverable faults. (baolu)
>  - Maintain a global IOPF enabled group list.
>  - Split the pre-mapping optimization to be a separate patch.
>  - Add selective faulting support (use vfio_pin_pages to indicate the
>    non-faultable scope and add a new struct vfio_range to record it,
>    untested). (Kevin)
> 
> v1 -> v2
>  - Numerous improvements following the suggestions. Thanks a lot to all
>    of you.
> 
> Note that PRI is not supported at the moment since there is no hardware.
> 
> Links:
> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>     2016.
> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>     for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
> [4] https://github.com/Linaro/uadk
> 
> Thanks,
> Shenming
> 
> 
> Shenming Lu (8):
>   iommu: Evolve the device fault reporting framework
>   vfio/type1: Add a page fault handler
>   vfio/type1: Add an MMU notifier to avoid pinning
>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>   vfio/type1: No need to statically pin and map if IOPF enabled
>   vfio/type1: Add selective DMA faulting support
>   vfio: Add nested IOPF support
> 
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>  drivers/iommu/iommu.c                         |   56 +-
>  drivers/vfio/vfio.c                           |   85 +-
>  drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
>  include/linux/iommu.h                         |   19 +-
>  include/linux/vfio.h                          |   13 +
>  include/uapi/linux/iommu.h                    |    4 +
>  include/uapi/linux/vfio.h                     |    6 +
>  9 files changed, 1181 insertions(+), 23 deletions(-)
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
  2021-04-26  1:41 ` [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
@ 2021-05-11 11:30   ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-05-11 11:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

Hi Alex,

Hope for some suggestions or comments from you since there seems to be many unsure
points in this series. :-)

Thanks,
Shenming


On 2021/4/26 9:41, Shenming Lu wrote:
> On 2021/4/9 11:44, Shenming Lu wrote:
>> Hi,
>>
>> Requesting for your comments and suggestions. :-)
> 
> Kind ping...
> 
>>
>> The static pinning and mapping problem in VFIO and possible solutions
>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>> Page Fault support for VFIO devices. Different from those relatively
>> complicated software approaches such as presenting a vIOMMU that provides
>> the DMA buffer information (might include para-virtualized optimizations),
>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>> support for VFIO passthrough based on the IOPF part of SVA in this series.
>>
>> We have measured its performance with UADK [4] (passthrough an accelerator
>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>
>> Run hisi_sec_test...
>>  - with varying sending times and message lengths
>>  - with/without IOPF enabled (speed slowdown)
>>
>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>>             slowdown (num of faults)
>>  times      VFIO IOPF      host SVA
>>  1          63.4% (518)    82.8% (512)
>>  100        22.9% (1058)   47.9% (1024)
>>  1000       2.6% (1071)    8.5% (1024)
>>
>> when msg_len = 10MB (and PREMAP_LEN = 512):
>>             slowdown (num of faults)
>>  times      VFIO IOPF
>>  1          32.6% (13)
>>  100        3.5% (26)
>>  1000       1.6% (26)
>>
>> History:
>>
>> v2 -> v3
>>  - Nit fixes.
>>  - No reason to disable reporting the unrecoverable faults. (baolu)
>>  - Maintain a global IOPF enabled group list.
>>  - Split the pre-mapping optimization to be a separate patch.
>>  - Add selective faulting support (use vfio_pin_pages to indicate the
>>    non-faultable scope and add a new struct vfio_range to record it,
>>    untested). (Kevin)
>>
>> v1 -> v2
>>  - Numerous improvements following the suggestions. Thanks a lot to all
>>    of you.
>>
>> Note that PRI is not supported at the moment since there is no hardware.
>>
>> Links:
>> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>>     2016.
>> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>>     for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
>> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
>> [4] https://github.com/Linaro/uadk
>>
>> Thanks,
>> Shenming
>>
>>
>> Shenming Lu (8):
>>   iommu: Evolve the device fault reporting framework
>>   vfio/type1: Add a page fault handler
>>   vfio/type1: Add an MMU notifier to avoid pinning
>>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>>   vfio/type1: No need to statically pin and map if IOPF enabled
>>   vfio/type1: Add selective DMA faulting support
>>   vfio: Add nested IOPF support
>>
>>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>>  drivers/iommu/iommu.c                         |   56 +-
>>  drivers/vfio/vfio.c                           |   85 +-
>>  drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
>>  include/linux/iommu.h                         |   19 +-
>>  include/linux/vfio.h                          |   13 +
>>  include/uapi/linux/iommu.h                    |    4 +
>>  include/uapi/linux/vfio.h                     |    6 +
>>  9 files changed, 1181 insertions(+), 23 deletions(-)
>>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
  2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
                   ` (8 preceding siblings ...)
  2021-04-26  1:41 ` [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
@ 2021-05-18 18:57 ` Alex Williamson
  2021-05-21  6:37   ` Shenming Lu
  9 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-18 18:57 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 9 Apr 2021 11:44:12 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> Hi,
> 
> Requesting for your comments and suggestions. :-)
> 
> The static pinning and mapping problem in VFIO and possible solutions
> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> Page Fault support for VFIO devices. Different from those relatively
> complicated software approaches such as presenting a vIOMMU that provides
> the DMA buffer information (might include para-virtualized optimizations),
> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> support for VFIO passthrough based on the IOPF part of SVA in this series.

The SVA proposals are being reworked to make use of a new IOASID
object, it's not clear to me that this shouldn't also make use of that
work as it does a significant expansion of the type1 IOMMU with fault
handlers that would duplicate new work using that new model.

> We have measured its performance with UADK [4] (passthrough an accelerator
> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
> 
> Run hisi_sec_test...
>  - with varying sending times and message lengths
>  - with/without IOPF enabled (speed slowdown)
> 
> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>             slowdown (num of faults)
>  times      VFIO IOPF      host SVA
>  1          63.4% (518)    82.8% (512)
>  100        22.9% (1058)   47.9% (1024)
>  1000       2.6% (1071)    8.5% (1024)
> 
> when msg_len = 10MB (and PREMAP_LEN = 512):
>             slowdown (num of faults)
>  times      VFIO IOPF
>  1          32.6% (13)
>  100        3.5% (26)
>  1000       1.6% (26)

It seems like this is only an example that you can make a benchmark
show anything you want.  The best results would be to pre-map
everything, which is what we have without this series.  What is an
acceptable overhead to incur to avoid page pinning?  What if userspace
had more fine grained control over which mappings were available for
faulting and which were statically mapped?  I don't really see what
sense the pre-mapping range makes.  If I assume the user is QEMU in a
non-vIOMMU configuration, pre-mapping the beginning of each RAM section
doesn't make any logical sense relative to device DMA.

Comments per patch to follow.  Thanks,

Alex


> History:
> 
> v2 -> v3
>  - Nit fixes.
>  - No reason to disable reporting the unrecoverable faults. (baolu)
>  - Maintain a global IOPF enabled group list.
>  - Split the pre-mapping optimization to be a separate patch.
>  - Add selective faulting support (use vfio_pin_pages to indicate the
>    non-faultable scope and add a new struct vfio_range to record it,
>    untested). (Kevin)
> 
> v1 -> v2
>  - Numerous improvements following the suggestions. Thanks a lot to all
>    of you.
> 
> Note that PRI is not supported at the moment since there is no hardware.
> 
> Links:
> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>     2016.
> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>     for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
> [4] https://github.com/Linaro/uadk
> 
> Thanks,
> Shenming
> 
> 
> Shenming Lu (8):
>   iommu: Evolve the device fault reporting framework
>   vfio/type1: Add a page fault handler
>   vfio/type1: Add an MMU notifier to avoid pinning
>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>   vfio/type1: No need to statically pin and map if IOPF enabled
>   vfio/type1: Add selective DMA faulting support
>   vfio: Add nested IOPF support
> 
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>  drivers/iommu/iommu.c                         |   56 +-
>  drivers/vfio/vfio.c                           |   85 +-
>  drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
>  include/linux/iommu.h                         |   19 +-
>  include/linux/vfio.h                          |   13 +
>  include/uapi/linux/iommu.h                    |    4 +
>  include/uapi/linux/vfio.h                     |    6 +
>  9 files changed, 1181 insertions(+), 23 deletions(-)
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
  2021-04-09  3:44 ` [RFC PATCH v3 8/8] vfio: Add nested IOPF support Shenming Lu
@ 2021-05-18 18:58   ` Alex Williamson
  2021-05-21  7:59     ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-18 18:58 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 9 Apr 2021 11:44:20 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> To set up nested mode, drivers such as vfio_pci need to register a
> handler to receive stage/level 1 faults from the IOMMU, but since
> currently each device can only have one iommu dev fault handler,
> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
> we choose to update the registered handler (a consolidated one) via
> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
> stage 1 faults in the handler to the guest through a newly added
> vfio_device_ops callback.

Are there proposed in-kernel drivers that would use any of these
symbols?

> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> ---
>  drivers/vfio/vfio.c             | 81 +++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
>  include/linux/vfio.h            | 12 +++++
>  3 files changed, 141 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 44c8dfabf7de..4245f15914bf 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
>  }
>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>  
> +/*
> + * Register/Update the VFIO IOPF handler to receive
> + * nested stage/level 1 faults.
> + */
> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	int ret;
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	group = vfio_group_get_from_dev(dev);
> +	if (!group)
> +		return -ENODEV;
> +
> +	ret = vfio_group_add_container_user(group);
> +	if (ret)
> +		goto out;
> +
> +	container = group->container;
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->register_handler))
> +		ret = driver->ops->register_handler(container->iommu_data, dev);
> +	else
> +		ret = -ENOTTY;
> +
> +	vfio_group_try_dissolve_container(group);
> +
> +out:
> +	vfio_group_put(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
> +
> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	int ret;
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	group = vfio_group_get_from_dev(dev);
> +	if (!group)
> +		return -ENODEV;
> +
> +	ret = vfio_group_add_container_user(group);
> +	if (ret)
> +		goto out;
> +
> +	container = group->container;
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->unregister_handler))
> +		ret = driver->ops->unregister_handler(container->iommu_data, dev);
> +	else
> +		ret = -ENOTTY;
> +
> +	vfio_group_try_dissolve_container(group);
> +
> +out:
> +	vfio_group_put(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
> +
> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
> +{
> +	struct vfio_device *device = dev_get_drvdata(dev);
> +
> +	if (unlikely(!device->ops->transfer))
> +		return -EOPNOTSUPP;
> +
> +	return device->ops->transfer(device->device_data, fault);
> +}
> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index ba2b5a1cf6e9..9d1adeddb303 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>  	struct vfio_batch batch;
>  	struct vfio_range *range;
>  	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
> -	int access_flags = 0;
> +	int access_flags = 0, nested;
>  	size_t premap_len, map_len, mapped_len = 0;
>  	unsigned long bit_offset, vaddr, pfn, i, npages;
>  	int ret;
>  	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>  	struct iommu_page_response resp = {0};
>  
> +	if (vfio_dev_domian_nested(dev, &nested))
> +		return -ENODEV;
> +
> +	/*
> +	 * When configured in nested mode, further deliver the
> +	 * stage/level 1 faults to the guest.
> +	 */
> +	if (nested) {
> +		bool l2;
> +
> +		if (fault->type == IOMMU_FAULT_PAGE_REQ)
> +			l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
> +		if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
> +			l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
> +
> +		if (!l2)
> +			return vfio_transfer_iommu_fault(dev, fault);
> +	}
> +
>  	if (fault->type != IOMMU_FAULT_PAGE_REQ)
>  		return -EOPNOTSUPP;
>  
> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
>  	wake_up_all(&iommu->vaddr_wait);
>  }
>  
> +static int vfio_iommu_type1_register_handler(void *iommu_data,
> +					     struct device *dev)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +
> +	if (iommu->iopf_enabled)
> +		return iommu_update_device_fault_handler(dev, ~0,
> +						FAULT_REPORT_NESTED_L1);
> +	else
> +		return iommu_register_device_fault_handler(dev,
> +						vfio_iommu_type1_dma_map_iopf,
> +						FAULT_REPORT_NESTED_L1, dev);
> +}
> +
> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
> +					       struct device *dev)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +
> +	if (iommu->iopf_enabled)
> +		return iommu_update_device_fault_handler(dev,
> +						~FAULT_REPORT_NESTED_L1, 0);
> +	else
> +		return iommu_unregister_device_fault_handler(dev);
> +}


The path through vfio to register this is pretty ugly, but I don't see
any reason for the the update interfaces here, the previously
registered handler just changes its behavior.


> +
>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.name			= "vfio-iommu-type1",
>  	.owner			= THIS_MODULE,
> @@ -4216,6 +4261,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.dma_rw			= vfio_iommu_type1_dma_rw,
>  	.group_iommu_domain	= vfio_iommu_type1_group_iommu_domain,
>  	.notify			= vfio_iommu_type1_notify,
> +	.register_handler	= vfio_iommu_type1_register_handler,
> +	.unregister_handler	= vfio_iommu_type1_unregister_handler,
>  };
>  
>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index a7b426d579df..4621d8f0395d 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -29,6 +29,8 @@
>   * @match: Optional device name match callback (return: 0 for no-match, >0 for
>   *         match, -errno for abort (ex. match with insufficient or incorrect
>   *         additional args)
> + * @transfer: Optional. Transfer the received stage/level 1 faults to the guest
> + *            for nested mode.
>   */
>  struct vfio_device_ops {
>  	char	*name;
> @@ -43,6 +45,7 @@ struct vfio_device_ops {
>  	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
>  	void	(*request)(void *device_data, unsigned int count);
>  	int	(*match)(void *device_data, char *buf);
> +	int	(*transfer)(void *device_data, struct iommu_fault *fault);
>  };
>  
>  extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
> @@ -100,6 +103,10 @@ struct vfio_iommu_driver_ops {
>  						   struct iommu_group *group);
>  	void		(*notify)(void *iommu_data,
>  				  enum vfio_iommu_notify_type event);
> +	int		(*register_handler)(void *iommu_data,
> +					    struct device *dev);
> +	int		(*unregister_handler)(void *iommu_data,
> +					      struct device *dev);
>  };
>  
>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> @@ -161,6 +168,11 @@ extern int vfio_unregister_notifier(struct device *dev,
>  struct kvm;
>  extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
>  
> +extern int vfio_iommu_dev_fault_handler_register_nested(struct device *dev);
> +extern int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev);
> +extern int vfio_transfer_iommu_fault(struct device *dev,
> +				     struct iommu_fault *fault);
> +
>  /*
>   * Sub-module helpers
>   */


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support
  2021-04-09  3:44 ` [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support Shenming Lu
@ 2021-05-18 18:58   ` Alex Williamson
  2021-05-21  6:39     ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-18 18:58 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 9 Apr 2021 11:44:19 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> Some devices only allow selective DMA faulting. Similar to the selective
> dirty page tracking, the vendor driver can call vfio_pin_pages() to
> indicate the non-faultable scope, we add a new struct vfio_range to
> record it, then when the IOPF handler receives any page request out
> of the scope, we can directly return with an invalid response.

Seems like this highlights a deficiency in the design, that the user
can't specify mappings as iopf enabled or disabled.  Also, if the
vendor driver has pinned pages within the range, shouldn't that prevent
them from faulting in the first place?  Why do we need yet more
tracking structures?  Pages pinned by the vendor driver need to count
against the user's locked memory limits regardless of iopf.  Thanks,

Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled
  2021-04-09  3:44 ` [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled Shenming Lu
@ 2021-05-18 18:58   ` Alex Williamson
  2021-05-21  6:39     ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-18 18:58 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 9 Apr 2021 11:44:18 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> If IOPF enabled for the VFIO container, there is no need to statically
> pin and map the entire DMA range, we can do it on demand. And unmap
> according to the IOPF mapped bitmap when removing vfio_dma.
> 
> Note that we still mark all pages dirty even if IOPF enabled, we may
> add IOPF-based fine grained dirty tracking support in the future.
> 
> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 38 +++++++++++++++++++++++++++------
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 7df5711e743a..dcc93c3b258c 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -175,6 +175,7 @@ struct vfio_iopf_group {
>  #define IOPF_MAPPED_BITMAP_GET(dma, i)	\
>  			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
>  			       >> ((i) % BITS_PER_LONG)) & 0x1)  
> +#define IOPF_MAPPED_BITMAP_BYTES(n)	DIRTY_BITMAP_BYTES(n)
>  
>  #define WAITED 1
>  
> @@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  	 * already pinned and accounted. Accouting should be done if there is no
>  	 * iommu capable domain in the container.
>  	 */
> -	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
> +	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
> +			iommu->iopf_enabled;
>  
>  	for (i = 0; i < npage; i++) {
>  		struct vfio_pfn *vpfn;
> @@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
> +	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
> +			iommu->iopf_enabled;

pin/unpin are actually still pinning pages, why does iopf exempt them
from accounting?


>  	for (i = 0; i < npage; i++) {
>  		struct vfio_dma *dma;
>  		dma_addr_t iova;
> @@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  	if (!dma->size)
>  		return 0;
>  
> -	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
>  		return 0;
>  
>  	/*
> @@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
>  	}
>  }
>  
> +static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma)
> +{
> +	vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
> +
> +	kfree(dma->iopf_mapped_bitmap);
> +}
> +
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
>  	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
>  	vfio_unmap_unpin(iommu, dma, true);
>  	vfio_unlink_dma(iommu, dma);
> +	if (iommu->iopf_enabled)
> +		vfio_dma_clean_iopf(iommu, dma);
>  	put_task_struct(dma->task);
>  	vfio_dma_bitmap_free(dma);
>  	if (dma->vaddr_invalid) {
> @@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>  	 * mark all pages dirty if any IOMMU capable device is not able
>  	 * to report dirty pages and all pages are pinned and mapped.
>  	 */
> -	if (iommu->num_non_pinned_groups && dma->iommu_mapped)
> +	if (iommu->num_non_pinned_groups &&
> +	    (dma->iommu_mapped || iommu->iopf_enabled))
>  		bitmap_set(dma->bitmap, 0, nbits);

This seems like really poor integration of iopf into dirty page
tracking.  I'd expect dirty logging to flush the mapped pages and
write faults to mark pages dirty.  Shouldn't the fault handler also
provide only the access faulted, so for example a read fault wouldn't
mark the page dirty?

>  
>  	if (shift) {
> @@ -1772,6 +1785,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  		goto out_unlock;
>  	}
>  
> +	if (iommu->iopf_enabled) {
> +		dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES(
> +						size >> PAGE_SHIFT), GFP_KERNEL);
> +		if (!dma->iopf_mapped_bitmap) {
> +			ret = -ENOMEM;
> +			kfree(dma);
> +			goto out_unlock;
> +		}


So we're assuming nothing can fault and therefore nothing can reference
the iopf_mapped_bitmap until this point in the series?


> +	}
> +
>  	iommu->dma_avail--;
>  	dma->iova = iova;
>  	dma->vaddr = vaddr;
> @@ -1811,8 +1834,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> -	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
> -	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> +	/*
> +	 * Don't pin and map if container doesn't contain IOMMU capable domain,
> +	 * or IOPF enabled for the container.
> +	 */
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
>  		dma->size = size;
>  	else
>  		ret = vfio_pin_map_dma(iommu, dma, size);


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF
  2021-04-09  3:44 ` [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF Shenming Lu
@ 2021-05-18 18:58   ` Alex Williamson
  2021-05-21  6:38     ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-18 18:58 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 9 Apr 2021 11:44:17 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> Since enabling IOPF for devices may lead to a slow ramp up of performance,
> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
> registering the VFIO IOPF handler.
> 
> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
> inflight page faults when disabling.
> 
> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
>  include/uapi/linux/vfio.h       |   6 +
>  2 files changed, 226 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 01e296c6dc9e..7df5711e743a 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -71,6 +71,7 @@ struct vfio_iommu {
>  	struct rb_root		dma_list;
>  	struct blocking_notifier_head notifier;
>  	struct mmu_notifier	mn;
> +	struct mm_struct	*mm;

We currently have no requirement that a single mm is used for all
container mappings.  Does enabling IOPF impose such a requirement?
Shouldn't MAP/UNMAP enforce that?

>  	unsigned int		dma_avail;
>  	unsigned int		vaddr_invalid_count;
>  	uint64_t		pgsize_bitmap;
> @@ -81,6 +82,7 @@ struct vfio_iommu {
>  	bool			dirty_page_tracking;
>  	bool			pinned_page_dirty_scope;
>  	bool			container_open;
> +	bool			iopf_enabled;
>  };
>  
>  struct vfio_domain {
> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
>  	return node ? iopf_group : NULL;
>  }
>  
> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_iopf_group *iopf_group;
> +
> +	mutex_lock(&iopf_group_list_lock);
> +
> +	link = &iopf_group_list.rb_node;
> +
> +	while (*link) {
> +		parent = *link;
> +		iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
> +
> +		if (new->iommu_group < iopf_group->iommu_group)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &iopf_group_list);
> +
> +	mutex_unlock(&iopf_group_list_lock);
> +}
> +
> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
> +{
> +	mutex_lock(&iopf_group_list_lock);
> +	rb_erase(&old->node, &iopf_group_list);
> +	mutex_unlock(&iopf_group_list_lock);
> +}
> +
>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>  {
>  	struct mm_struct *mm;
> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
>  	list_splice_tail(iova_copy, iova);
>  }
>  
> +static int vfio_dev_domian_nested(struct device *dev, int *nested)


s/domian/domain/


> +{
> +	struct iommu_domain *domain;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!domain)
> +		return -ENODEV;
> +
> +	return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);


Wouldn't this be easier to use if it returned bool, false on -errno?


> +}
> +
> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
> +
> +static int dev_enable_iopf(struct device *dev, void *data)
> +{
> +	int *enabled_dev_cnt = data;
> +	int nested;
> +	u32 flags;
> +	int ret;
> +
> +	ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
> +	if (ret)
> +		return ret;
> +
> +	ret = vfio_dev_domian_nested(dev, &nested);
> +	if (ret)
> +		goto out_disable;
> +
> +	if (nested)
> +		flags = FAULT_REPORT_NESTED_L2;
> +	else
> +		flags = FAULT_REPORT_FLAT;
> +
> +	ret = iommu_register_device_fault_handler(dev,
> +				vfio_iommu_type1_dma_map_iopf, flags, dev);
> +	if (ret)
> +		goto out_disable;
> +
> +	(*enabled_dev_cnt)++;
> +	return 0;
> +
> +out_disable:
> +	iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
> +	return ret;
> +}
> +
> +static int dev_disable_iopf(struct device *dev, void *data)
> +{
> +	int *enabled_dev_cnt = data;
> +
> +	if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
> +		return -1;


Use an errno.


> +
> +	WARN_ON(iommu_unregister_device_fault_handler(dev));
> +	WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
> +
> +	if (enabled_dev_cnt)
> +		(*enabled_dev_cnt)--;


Why do we need these counters?

> +
> +	return 0;
> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
> @@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	struct iommu_domain_geometry geo;
>  	LIST_HEAD(iova_copy);
>  	LIST_HEAD(group_resv_regions);
> +	int iopf_enabled_dev_cnt = 0;
> +	struct vfio_iopf_group *iopf_group = NULL;
>  
>  	mutex_lock(&iommu->lock);
>  
> @@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_domain;
>  
> +	if (iommu->iopf_enabled) {
> +		ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
> +					       dev_enable_iopf);
> +		if (ret)
> +			goto out_detach;
> +
> +		iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
> +		if (!iopf_group) {
> +			ret = -ENOMEM;
> +			goto out_detach;
> +		}
> +
> +		iopf_group->iommu_group = iommu_group;
> +		iopf_group->iommu = iommu;
> +
> +		vfio_link_iopf_group(iopf_group);

This seems backwards, once we enable iopf we can take a fault, so the
structure to lookup the data for the device should be setup first.  I'm
still not convinced this iopf_group rb tree is a good solution to get
from the device to the iommu either.  vfio-core could traverse dev ->
iommu_group -> vfio_group -> container -> iommu_data.


> +	}
> +
>  	/* Get aperture info */
>  	iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
>  
> @@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	vfio_test_domain_fgsp(domain);
>  
>  	/* replay mappings on new domains */
> -	ret = vfio_iommu_replay(iommu, domain);
> -	if (ret)
> -		goto out_detach;
> +	if (!iommu->iopf_enabled) {
> +		ret = vfio_iommu_replay(iommu, domain);
> +		if (ret)
> +			goto out_detach;
> +	}

Test in vfio_iommu_replay()?

>  
>  	if (resv_msi) {
>  		ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
> @@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	iommu_domain_free(domain->domain);
>  	vfio_iommu_iova_free(&iova_copy);
>  	vfio_iommu_resv_free(&group_resv_regions);
> +	if (iommu->iopf_enabled) {
> +		if (iopf_group) {
> +			vfio_unlink_iopf_group(iopf_group);
> +			kfree(iopf_group);
> +		}
> +
> +		iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
> +					 dev_disable_iopf);

Backwards, disable fault reporting before unlinking lookup.

> +	}
>  out_free:
>  	kfree(domain);
>  	kfree(group);
> @@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  		if (!group)
>  			continue;
>  
> +		if (iommu->iopf_enabled) {
> +			struct vfio_iopf_group *iopf_group;
> +
> +			iopf_group = vfio_find_iopf_group(iommu_group);
> +			if (!WARN_ON(!iopf_group)) {
> +				vfio_unlink_iopf_group(iopf_group);
> +				kfree(iopf_group);
> +			}
> +
> +			iommu_group_for_each_dev(iommu_group, NULL,
> +						 dev_disable_iopf);


Backwards.  Also unregistering the fault handler can fail if there are
pending faults.  It appears the fault handler and iopf lookup could also
race with this function, ex. fault handler gets an iopf object before
it's removed from the rb-tree, blocks trying to acquire iommu->lock,
disable_iopf fails, detach proceeds, fault handler has use after free
error.


> +		}
> +
>  		vfio_iommu_detach_group(domain, group);
>  		update_dirty_scope = !group->pinned_page_dirty_scope;
>  		list_del(&group->next);
> @@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  
>  	vfio_iommu_iova_free(&iommu->iova_list);
>  
> +	if (iommu->iopf_enabled) {
> +		mmu_notifier_unregister(&iommu->mn, iommu->mm);
> +		mmdrop(iommu->mm);
> +	}
> +
>  	kfree(iommu);
>  }
>  
> @@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
>  	.invalidate_range	= mn_invalidate_range,
>  };
>  
> +static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *d;
> +	struct vfio_group *g;
> +	struct vfio_iopf_group *iopf_group;
> +	int enabled_dev_cnt = 0;
> +	int ret;
> +
> +	if (!current->mm)
> +		return -ENODEV;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	mmgrab(current->mm);


As above, this is a new and undocumented requirement.


> +	iommu->mm = current->mm;
> +	iommu->mn.ops = &vfio_iommu_type1_mn_ops;
> +	ret = mmu_notifier_register(&iommu->mn, current->mm);
> +	if (ret)
> +		goto out_drop;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		list_for_each_entry(g, &d->group_list, next) {
> +			ret = iommu_group_for_each_dev(g->iommu_group,
> +					&enabled_dev_cnt, dev_enable_iopf);
> +			if (ret)
> +				goto out_unwind;
> +
> +			iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
> +			if (!iopf_group) {
> +				ret = -ENOMEM;
> +				goto out_unwind;
> +			}
> +
> +			iopf_group->iommu_group = g->iommu_group;
> +			iopf_group->iommu = iommu;
> +
> +			vfio_link_iopf_group(iopf_group);
> +		}
> +	}
> +
> +	iommu->iopf_enabled = true;
> +	goto out_unlock;
> +
> +out_unwind:
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		list_for_each_entry(g, &d->group_list, next) {
> +			iopf_group = vfio_find_iopf_group(g->iommu_group);
> +			if (iopf_group) {
> +				vfio_unlink_iopf_group(iopf_group);
> +				kfree(iopf_group);
> +			}
> +
> +			if (iommu_group_for_each_dev(g->iommu_group,
> +					&enabled_dev_cnt, dev_disable_iopf))
> +				goto out_unregister;


This seems broken, we break out of the unwind if any group_for_each
fails??

> +		}
> +	}
> +
> +out_unregister:
> +	mmu_notifier_unregister(&iommu->mn, current->mm);
> +
> +out_drop:
> +	iommu->mm = NULL;
> +	mmdrop(current->mm);
> +
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}


Is there an assumption that the user does this before performing any
mapping?  I don't see where current mappings would be converted which
means we'll have pinned pages leaked and vfio_dma objects without a
pinned page bitmap.

> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -3457,6 +3672,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		return vfio_iommu_type1_unmap_dma(iommu, arg);
>  	case VFIO_IOMMU_DIRTY_PAGES:
>  		return vfio_iommu_type1_dirty_pages(iommu, arg);
> +	case VFIO_IOMMU_ENABLE_IOPF:
> +		return vfio_iommu_type1_enable_iopf(iommu);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8ce36c1d53ca..5497036bebdc 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1208,6 +1208,12 @@ struct vfio_iommu_type1_dirty_bitmap_get {
>  
>  #define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
>  
> +/*
> + * IOCTL to enable IOPF for the container.
> + * Called right after VFIO_SET_IOMMU.
> + */
> +#define VFIO_IOMMU_ENABLE_IOPF             _IO(VFIO_TYPE, VFIO_BASE + 18)


We have extensions and flags and capabilities, and nothing tells the
user whether this feature is available to them without trying it?


> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling
  2021-04-09  3:44 ` [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling Shenming Lu
@ 2021-05-18 18:58   ` Alex Williamson
  2021-05-21  6:37     ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-18 18:58 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 9 Apr 2021 11:44:16 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> To optimize for fewer page fault handlings, we can pre-map more pages
> than requested at once.
> 
> Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we
> could try further tuning.

I'd prefer that the series introduced full end-to-end functionality
before trying to improve performance.  The pre-map value seems
arbitrary here and as noted in the previous patch, the IOMMU API does
not guarantee unmaps of ranges smaller than the original mapping.  This
would need to map with single page granularity in order to guarantee
page granularity at the mmu notifier when the IOMMU supports
superpages.  Thanks,

Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning
  2021-04-09  3:44 ` [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning Shenming Lu
@ 2021-05-18 18:58   ` Alex Williamson
  2021-05-21  6:37     ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-18 18:58 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 9 Apr 2021 11:44:15 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> To avoid pinning pages when they are mapped in IOMMU page tables, we
> add an MMU notifier to tell the addresses which are no longer valid
> and try to unmap them.
> 
> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 112 +++++++++++++++++++++++++++++++-
>  1 file changed, 109 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index ab0ff60ee207..1cb9d1f2717b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -40,6 +40,7 @@
>  #include <linux/notifier.h>
>  #include <linux/dma-iommu.h>
>  #include <linux/irqdomain.h>
> +#include <linux/mmu_notifier.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -69,6 +70,7 @@ struct vfio_iommu {
>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	struct blocking_notifier_head notifier;
> +	struct mmu_notifier	mn;
>  	unsigned int		dma_avail;
>  	unsigned int		vaddr_invalid_count;
>  	uint64_t		pgsize_bitmap;
> @@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  	return unlocked;
>  }
>  
> +/* Unmap the IOPF mapped pages in the specified range. */
> +static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
> +				    struct vfio_dma *dma,
> +				    dma_addr_t start, dma_addr_t end)
> +{
> +	struct iommu_iotlb_gather *gathers;
> +	struct vfio_domain *d;
> +	int i, num_domains = 0;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next)
> +		num_domains++;
> +
> +	gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL);
> +	if (gathers) {
> +		for (i = 0; i < num_domains; i++)
> +			iommu_iotlb_gather_init(&gathers[i]);
> +	}


If we're always serialized on iommu->lock, would it make sense to have
gathers pre-allocated on the vfio_iommu object?

> +
> +	while (start < end) {
> +		unsigned long bit_offset;
> +		size_t len;
> +
> +		bit_offset = (start - dma->iova) >> PAGE_SHIFT;
> +
> +		for (len = 0; start + len < end; len += PAGE_SIZE) {
> +			if (!IOPF_MAPPED_BITMAP_GET(dma,
> +					bit_offset + (len >> PAGE_SHIFT)))
> +				break;


There are bitmap helpers for this, find_first_bit(),
find_next_zero_bit().


> +		}
> +
> +		if (len) {
> +			i = 0;
> +			list_for_each_entry(d, &iommu->domain_list, next) {
> +				size_t unmapped;
> +
> +				if (gathers)
> +					unmapped = iommu_unmap_fast(d->domain,
> +								    start, len,
> +								    &gathers[i++]);
> +				else
> +					unmapped = iommu_unmap(d->domain,
> +							       start, len);
> +
> +				if (WARN_ON(unmapped != len))

The IOMMU API does not guarantee arbitrary unmap sizes, this will
trigger and this exit path is wrong.  If we've already unmapped the
IOMMU, shouldn't we proceed with @unmapped rather than @len so the
device can re-fault the extra mappings?  Otherwise the IOMMU state
doesn't match the iopf bitmap state.

> +					goto out;
> +			}
> +
> +			bitmap_clear(dma->iopf_mapped_bitmap,
> +				     bit_offset, len >> PAGE_SHIFT);
> +
> +			cond_resched();
> +		}
> +
> +		start += (len + PAGE_SIZE);
> +	}
> +
> +out:
> +	if (gathers) {
> +		i = 0;
> +		list_for_each_entry(d, &iommu->domain_list, next)
> +			iommu_iotlb_sync(d->domain, &gathers[i++]);
> +
> +		kfree(gathers);
> +	}
> +}
> +
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
>  	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
> @@ -3197,17 +3265,18 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>  
>  	vaddr = iova - dma->iova + dma->vaddr;
>  
> -	if (vfio_pin_page_external(dma, vaddr, &pfn, true))
> +	if (vfio_pin_page_external(dma, vaddr, &pfn, false))
>  		goto out_invalid;
>  
>  	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
> -		if (put_pfn(pfn, dma->prot))
> -			vfio_lock_acct(dma, -1, true);
> +		put_pfn(pfn, dma->prot);
>  		goto out_invalid;
>  	}
>  
>  	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
>  
> +	put_pfn(pfn, dma->prot);
> +
>  out_success:
>  	status = IOMMU_PAGE_RESP_SUCCESS;
>  
> @@ -3220,6 +3289,43 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>  	return 0;
>  }
>  
> +static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
> +				unsigned long start, unsigned long end)
> +{
> +	struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn);
> +	struct rb_node *n;
> +	int ret;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	ret = vfio_wait_all_valid(iommu);
> +	if (WARN_ON(ret < 0))
> +		return;

Is WARN_ON sufficient for this error condition?  We've been told to
evacuate a range of mm, the device still has DMA access, we've removed
page pinning.  This seems like a BUG_ON condition to me, we can't allow
the system to continue in any way with pages getting unmapped from
under the device.

> +
> +	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +		unsigned long start_n, end_n;
> +
> +		if (end <= dma->vaddr || start >= dma->vaddr + dma->size)
> +			continue;
> +
> +		start_n = ALIGN_DOWN(max_t(unsigned long, start, dma->vaddr),
> +				     PAGE_SIZE);
> +		end_n = ALIGN(min_t(unsigned long, end, dma->vaddr + dma->size),
> +			      PAGE_SIZE);
> +
> +		vfio_unmap_partial_iopf(iommu, dma,
> +					start_n - dma->vaddr + dma->iova,
> +					end_n - dma->vaddr + dma->iova);
> +	}
> +
> +	mutex_unlock(&iommu->lock);
> +}
> +
> +static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
> +	.invalidate_range	= mn_invalidate_range,
> +};
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {

Again, this patch series is difficult to follow because we're
introducing dead code until the mmu notifier actually has a path to be
registered.  We shouldn't be taking any faults until iopf is enabled,
so it seems like we can add more of the core support alongside this
code.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler
  2021-04-09  3:44 ` [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler Shenming Lu
@ 2021-05-18 18:58   ` Alex Williamson
  2021-05-21  6:38     ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-18 18:58 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 9 Apr 2021 11:44:14 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
> for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
> serve the reported page faults from the IOMMU driver.
> 
> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 114 ++++++++++++++++++++++++++++++++
>  1 file changed, 114 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 45cbfd4879a5..ab0ff60ee207 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -101,6 +101,7 @@ struct vfio_dma {
>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>  	unsigned long		*bitmap;
> +	unsigned long		*iopf_mapped_bitmap;
>  };
>  
>  struct vfio_batch {
> @@ -141,6 +142,16 @@ struct vfio_regions {
>  	size_t len;
>  };
>  
> +/* A global IOPF enabled group list */
> +static struct rb_root iopf_group_list = RB_ROOT;
> +static DEFINE_MUTEX(iopf_group_list_lock);
> +
> +struct vfio_iopf_group {
> +	struct rb_node		node;
> +	struct iommu_group	*iommu_group;
> +	struct vfio_iommu	*iommu;
> +};
> +
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>  					(!list_empty(&iommu->domain_list))
>  
> @@ -157,6 +168,10 @@ struct vfio_regions {
>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>  
> +#define IOPF_MAPPED_BITMAP_GET(dma, i)	\
> +			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
> +			       >> ((i) % BITS_PER_LONG)) & 0x1)


Can't we just use test_bit()?


> +
>  #define WAITED 1
>  
>  static int put_pfn(unsigned long pfn, int prot);
> @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>  	return ret;
>  }
>  
> +/*
> + * Helper functions for iopf_group_list
> + */
> +static struct vfio_iopf_group *
> +vfio_find_iopf_group(struct iommu_group *iommu_group)
> +{
> +	struct vfio_iopf_group *iopf_group;
> +	struct rb_node *node;
> +
> +	mutex_lock(&iopf_group_list_lock);
> +
> +	node = iopf_group_list.rb_node;
> +
> +	while (node) {
> +		iopf_group = rb_entry(node, struct vfio_iopf_group, node);
> +
> +		if (iommu_group < iopf_group->iommu_group)
> +			node = node->rb_left;
> +		else if (iommu_group > iopf_group->iommu_group)
> +			node = node->rb_right;
> +		else
> +			break;
> +	}
> +
> +	mutex_unlock(&iopf_group_list_lock);
> +	return node ? iopf_group : NULL;
> +}

This looks like a pretty heavy weight operation per DMA fault.

I'm also suspicious of this validity of this iopf_group after we've
dropped the locking, the ordering of patches makes this very confusing.

> +
>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>  {
>  	struct mm_struct *mm;
> @@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>  	return -EINVAL;
>  }
>  
> +/* VFIO I/O Page Fault handler */
> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)

From the comment, this seems like the IOMMU fault handler (the
construction of this series makes this difficult to follow) and
eventually it handles more than DMA mapping, for example transferring
faults to the device driver.  "dma_map_iopf" seems like a poorly scoped
name.

> +{
> +	struct device *dev = (struct device *)data;
> +	struct iommu_group *iommu_group;
> +	struct vfio_iopf_group *iopf_group;
> +	struct vfio_iommu *iommu;
> +	struct vfio_dma *dma;
> +	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
> +	int access_flags = 0;
> +	unsigned long bit_offset, vaddr, pfn;
> +	int ret;
> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
> +	struct iommu_page_response resp = {0};
> +
> +	if (fault->type != IOMMU_FAULT_PAGE_REQ)
> +		return -EOPNOTSUPP;
> +
> +	iommu_group = iommu_group_get(dev);
> +	if (!iommu_group)
> +		return -ENODEV;
> +
> +	iopf_group = vfio_find_iopf_group(iommu_group);
> +	iommu_group_put(iommu_group);
> +	if (!iopf_group)
> +		return -ENODEV;
> +
> +	iommu = iopf_group->iommu;
> +
> +	mutex_lock(&iommu->lock);

Again, I'm dubious of our ability to grab this lock from an object with
an unknown lifecycle and races we might have with that group being
detached or DMA unmapped.  Also, how effective is enabling IOMMU page
faulting if we're serializing all faults within a container context?

> +
> +	ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
> +	if (ret < 0)
> +		goto out_invalid;
> +
> +	if (fault->prm.perm & IOMMU_FAULT_PERM_READ)
> +		access_flags |= IOMMU_READ;
> +	if (fault->prm.perm & IOMMU_FAULT_PERM_WRITE)
> +		access_flags |= IOMMU_WRITE;
> +	if ((dma->prot & access_flags) != access_flags)
> +		goto out_invalid;
> +
> +	bit_offset = (iova - dma->iova) >> PAGE_SHIFT;
> +	if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset))
> +		goto out_success;

If the page is mapped, why did we get a fault?  Should we be returning
success for a fault we shouldn't have received and did nothing to
resolve?  We're also referencing a bitmap that has only been declared
and never allocated at this point in the patch series.

> +
> +	vaddr = iova - dma->iova + dma->vaddr;
> +
> +	if (vfio_pin_page_external(dma, vaddr, &pfn, true))
> +		goto out_invalid;
> +
> +	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
> +		if (put_pfn(pfn, dma->prot))
> +			vfio_lock_acct(dma, -1, true);
> +		goto out_invalid;
> +	}
> +
> +	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
> +
> +out_success:
> +	status = IOMMU_PAGE_RESP_SUCCESS;
> +
> +out_invalid:
> +	mutex_unlock(&iommu->lock);
> +	resp.version		= IOMMU_PAGE_RESP_VERSION_1;
> +	resp.grpid		= fault->prm.grpid;
> +	resp.code		= status;
> +	iommu_page_response(dev, &resp);
> +	return 0;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework
  2021-04-09  3:44 ` [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework Shenming Lu
@ 2021-05-18 18:58   ` Alex Williamson
  2021-05-21  6:37     ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-18 18:58 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 9 Apr 2021 11:44:13 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> This patch follows the discussion here:
> 
> https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/
> 
> Besides SVA/vSVA, such as VFIO may also enable (2nd level) IOPF to remove
> pinning restriction. In order to better support more scenarios of using
> device faults, we extend iommu_register_fault_handler() with flags and
> introduce FAULT_REPORT_ to describe the device fault reporting capability
> under a specific configuration.
> 
> Note that we don't further distinguish recoverable and unrecoverable faults
> by flags in the fault reporting cap, having PAGE_FAULT_REPORT_ +
> UNRECOV_FAULT_REPORT_ seems not a clean way.
> 
> In addition, still take VFIO as an example, in nested mode, the 1st level
> and 2nd level fault reporting may be configured separately and currently
> each device can only register one iommu dev fault handler, so we add a
> handler update interface for this.


IIUC, you're introducing flags for the fault handler callout, which
essentially allows the IOMMU layer to filter which types of faults the
handler can receive.  You then need an update function to modify those
flags.  Why can't the handler itself perform this filtering?  For
instance in your vfio example, the handler registered by the type1
backend could itself return fault until the fault transfer path to the
device driver is established.  Thanks,

Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
  2021-05-18 18:57 ` Alex Williamson
@ 2021-05-21  6:37   ` Shenming Lu
  2021-05-24 22:11     ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-05-21  6:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

Hi Alex,

On 2021/5/19 2:57, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:12 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> Hi,
>>
>> Requesting for your comments and suggestions. :-)
>>
>> The static pinning and mapping problem in VFIO and possible solutions
>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>> Page Fault support for VFIO devices. Different from those relatively
>> complicated software approaches such as presenting a vIOMMU that provides
>> the DMA buffer information (might include para-virtualized optimizations),
>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>> support for VFIO passthrough based on the IOPF part of SVA in this series.
> 
> The SVA proposals are being reworked to make use of a new IOASID
> object, it's not clear to me that this shouldn't also make use of that
> work as it does a significant expansion of the type1 IOMMU with fault
> handlers that would duplicate new work using that new model.

It seems that the IOASID extension for guest SVA would not affect this series,
will we do any host-guest IOASID translation in the VFIO fault handler?

> 
>> We have measured its performance with UADK [4] (passthrough an accelerator
>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>
>> Run hisi_sec_test...
>>  - with varying sending times and message lengths
>>  - with/without IOPF enabled (speed slowdown)
>>
>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>>             slowdown (num of faults)
>>  times      VFIO IOPF      host SVA
>>  1          63.4% (518)    82.8% (512)
>>  100        22.9% (1058)   47.9% (1024)
>>  1000       2.6% (1071)    8.5% (1024)
>>
>> when msg_len = 10MB (and PREMAP_LEN = 512):
>>             slowdown (num of faults)
>>  times      VFIO IOPF
>>  1          32.6% (13)
>>  100        3.5% (26)
>>  1000       1.6% (26)
> 
> It seems like this is only an example that you can make a benchmark
> show anything you want.  The best results would be to pre-map
> everything, which is what we have without this series.  What is an
> acceptable overhead to incur to avoid page pinning?  What if userspace
> had more fine grained control over which mappings were available for
> faulting and which were statically mapped?  I don't really see what
> sense the pre-mapping range makes.  If I assume the user is QEMU in a
> non-vIOMMU configuration, pre-mapping the beginning of each RAM section
> doesn't make any logical sense relative to device DMA.

As you said in Patch 4, we can introduce full end-to-end functionality
before trying to improve performance, and I will drop the pre-mapping patch
in the current stage...

Is there a need that userspace wants more fine grained control over which
mappings are available for faulting? If so, we may evolve the MAP ioctl
to support for specifying the faulting range.

As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
(and page faults occur almost only when first accessing), and it seems that
there is a large optimization space compared to CPU page faulting.

Thanks,
Shenming

> 
> Comments per patch to follow.  Thanks,
> 
> Alex
> 
> 
>> History:
>>
>> v2 -> v3
>>  - Nit fixes.
>>  - No reason to disable reporting the unrecoverable faults. (baolu)
>>  - Maintain a global IOPF enabled group list.
>>  - Split the pre-mapping optimization to be a separate patch.
>>  - Add selective faulting support (use vfio_pin_pages to indicate the
>>    non-faultable scope and add a new struct vfio_range to record it,
>>    untested). (Kevin)
>>
>> v1 -> v2
>>  - Numerous improvements following the suggestions. Thanks a lot to all
>>    of you.
>>
>> Note that PRI is not supported at the moment since there is no hardware.
>>
>> Links:
>> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>>     2016.
>> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>>     for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
>> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
>> [4] https://github.com/Linaro/uadk
>>
>> Thanks,
>> Shenming
>>
>>
>> Shenming Lu (8):
>>   iommu: Evolve the device fault reporting framework
>>   vfio/type1: Add a page fault handler
>>   vfio/type1: Add an MMU notifier to avoid pinning
>>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>>   vfio/type1: No need to statically pin and map if IOPF enabled
>>   vfio/type1: Add selective DMA faulting support
>>   vfio: Add nested IOPF support
>>
>>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>>  drivers/iommu/iommu.c                         |   56 +-
>>  drivers/vfio/vfio.c                           |   85 +-
>>  drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
>>  include/linux/iommu.h                         |   19 +-
>>  include/linux/vfio.h                          |   13 +
>>  include/uapi/linux/iommu.h                    |    4 +
>>  include/uapi/linux/vfio.h                     |    6 +
>>  9 files changed, 1181 insertions(+), 23 deletions(-)
>>
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework
  2021-05-18 18:58   ` Alex Williamson
@ 2021-05-21  6:37     ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-05-21  6:37 UTC (permalink / raw)
  To: Alex Williamson, Lu Baolu, Kevin Tian
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, yi.l.liu, Christoph Hellwig,
	Jonathan Cameron, Barry Song, wanghaibin.wang, yuzenghui

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:13 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> This patch follows the discussion here:
>>
>> https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/
>>
>> Besides SVA/vSVA, such as VFIO may also enable (2nd level) IOPF to remove
>> pinning restriction. In order to better support more scenarios of using
>> device faults, we extend iommu_register_fault_handler() with flags and
>> introduce FAULT_REPORT_ to describe the device fault reporting capability
>> under a specific configuration.
>>
>> Note that we don't further distinguish recoverable and unrecoverable faults
>> by flags in the fault reporting cap, having PAGE_FAULT_REPORT_ +
>> UNRECOV_FAULT_REPORT_ seems not a clean way.
>>
>> In addition, still take VFIO as an example, in nested mode, the 1st level
>> and 2nd level fault reporting may be configured separately and currently
>> each device can only register one iommu dev fault handler, so we add a
>> handler update interface for this.
> 
> 
> IIUC, you're introducing flags for the fault handler callout, which
> essentially allows the IOMMU layer to filter which types of faults the
> handler can receive.  You then need an update function to modify those
> flags.  Why can't the handler itself perform this filtering?  For
> instance in your vfio example, the handler registered by the type1
> backend could itself return fault until the fault transfer path to the
> device driver is established.  Thanks,

As discussed in [1]:

In nested IOPF, we have to figure out whether a fault comes from L1 or L2.
A SMMU stall event comes with this information, but a PRI page request doesn't.
The IOMMU driver can walk the page tables to find out the level information.
If the walk terminates at L1, further inject to the guest. Otherwise fix the
fault at L2 in VFIO. It's not efficient compared to hardware-provided info.

And in pinned case, if VFIO can tell the IOMMU driver that no L2 fault is
expected, there is no need to walk the page tables in the IOMMU driver?

[1] https://patchwork.kernel.org/project/linux-arm-kernel/patch/20210108145217.2254447-4-jean-philippe@linaro.org/

Thanks,
Shenming

> 
> Alex
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning
  2021-05-18 18:58   ` Alex Williamson
@ 2021-05-21  6:37     ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-05-21  6:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:15 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> To avoid pinning pages when they are mapped in IOMMU page tables, we
>> add an MMU notifier to tell the addresses which are no longer valid
>> and try to unmap them.
>>
>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 112 +++++++++++++++++++++++++++++++-
>>  1 file changed, 109 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index ab0ff60ee207..1cb9d1f2717b 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -40,6 +40,7 @@
>>  #include <linux/notifier.h>
>>  #include <linux/dma-iommu.h>
>>  #include <linux/irqdomain.h>
>> +#include <linux/mmu_notifier.h>
>>  
>>  #define DRIVER_VERSION  "0.2"
>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>> @@ -69,6 +70,7 @@ struct vfio_iommu {
>>  	struct mutex		lock;
>>  	struct rb_root		dma_list;
>>  	struct blocking_notifier_head notifier;
>> +	struct mmu_notifier	mn;
>>  	unsigned int		dma_avail;
>>  	unsigned int		vaddr_invalid_count;
>>  	uint64_t		pgsize_bitmap;
>> @@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>>  	return unlocked;
>>  }
>>  
>> +/* Unmap the IOPF mapped pages in the specified range. */
>> +static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
>> +				    struct vfio_dma *dma,
>> +				    dma_addr_t start, dma_addr_t end)
>> +{
>> +	struct iommu_iotlb_gather *gathers;
>> +	struct vfio_domain *d;
>> +	int i, num_domains = 0;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next)
>> +		num_domains++;
>> +
>> +	gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL);
>> +	if (gathers) {
>> +		for (i = 0; i < num_domains; i++)
>> +			iommu_iotlb_gather_init(&gathers[i]);
>> +	}
> 
> 
> If we're always serialized on iommu->lock, would it make sense to have
> gathers pre-allocated on the vfio_iommu object?

Yeah, we can do it.

> 
>> +
>> +	while (start < end) {
>> +		unsigned long bit_offset;
>> +		size_t len;
>> +
>> +		bit_offset = (start - dma->iova) >> PAGE_SHIFT;
>> +
>> +		for (len = 0; start + len < end; len += PAGE_SIZE) {
>> +			if (!IOPF_MAPPED_BITMAP_GET(dma,
>> +					bit_offset + (len >> PAGE_SHIFT)))
>> +				break;
> 
> 
> There are bitmap helpers for this, find_first_bit(),
> find_next_zero_bit().

Thanks for the hint. :-)

> 
> 
>> +		}
>> +
>> +		if (len) {
>> +			i = 0;
>> +			list_for_each_entry(d, &iommu->domain_list, next) {
>> +				size_t unmapped;
>> +
>> +				if (gathers)
>> +					unmapped = iommu_unmap_fast(d->domain,
>> +								    start, len,
>> +								    &gathers[i++]);
>> +				else
>> +					unmapped = iommu_unmap(d->domain,
>> +							       start, len);
>> +
>> +				if (WARN_ON(unmapped != len))
> 
> The IOMMU API does not guarantee arbitrary unmap sizes, this will
> trigger and this exit path is wrong.  If we've already unmapped the
> IOMMU, shouldn't we proceed with @unmapped rather than @len so the
> device can re-fault the extra mappings?  Otherwise the IOMMU state
> doesn't match the iopf bitmap state.

OK, I will correct it.

And can we assume that the @unmapped values (returned from iommu_unmap)
of all domains are the same (since all domains share the same iopf_mapped_bitmap)?

> 
>> +					goto out;
>> +			}
>> +
>> +			bitmap_clear(dma->iopf_mapped_bitmap,
>> +				     bit_offset, len >> PAGE_SHIFT);
>> +
>> +			cond_resched();
>> +		}
>> +
>> +		start += (len + PAGE_SIZE);
>> +	}
>> +
>> +out:
>> +	if (gathers) {
>> +		i = 0;
>> +		list_for_each_entry(d, &iommu->domain_list, next)
>> +			iommu_iotlb_sync(d->domain, &gathers[i++]);
>> +
>> +		kfree(gathers);
>> +	}
>> +}
>> +
>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>>  	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
>> @@ -3197,17 +3265,18 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>>  
>>  	vaddr = iova - dma->iova + dma->vaddr;
>>  
>> -	if (vfio_pin_page_external(dma, vaddr, &pfn, true))
>> +	if (vfio_pin_page_external(dma, vaddr, &pfn, false))
>>  		goto out_invalid;
>>  
>>  	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
>> -		if (put_pfn(pfn, dma->prot))
>> -			vfio_lock_acct(dma, -1, true);
>> +		put_pfn(pfn, dma->prot);
>>  		goto out_invalid;
>>  	}
>>  
>>  	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
>>  
>> +	put_pfn(pfn, dma->prot);
>> +
>>  out_success:
>>  	status = IOMMU_PAGE_RESP_SUCCESS;
>>  
>> @@ -3220,6 +3289,43 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>>  	return 0;
>>  }
>>  
>> +static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
>> +				unsigned long start, unsigned long end)
>> +{
>> +	struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn);
>> +	struct rb_node *n;
>> +	int ret;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	ret = vfio_wait_all_valid(iommu);
>> +	if (WARN_ON(ret < 0))
>> +		return;
> 
> Is WARN_ON sufficient for this error condition?  We've been told to
> evacuate a range of mm, the device still has DMA access, we've removed
> page pinning.  This seems like a BUG_ON condition to me, we can't allow
> the system to continue in any way with pages getting unmapped from
> under the device.

Make sense.

> 
>> +
>> +	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +		unsigned long start_n, end_n;
>> +
>> +		if (end <= dma->vaddr || start >= dma->vaddr + dma->size)
>> +			continue;
>> +
>> +		start_n = ALIGN_DOWN(max_t(unsigned long, start, dma->vaddr),
>> +				     PAGE_SIZE);
>> +		end_n = ALIGN(min_t(unsigned long, end, dma->vaddr + dma->size),
>> +			      PAGE_SIZE);
>> +
>> +		vfio_unmap_partial_iopf(iommu, dma,
>> +					start_n - dma->vaddr + dma->iova,
>> +					end_n - dma->vaddr + dma->iova);
>> +	}
>> +
>> +	mutex_unlock(&iommu->lock);
>> +}
>> +
>> +static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
>> +	.invalidate_range	= mn_invalidate_range,
>> +};
>> +
>>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  				   unsigned int cmd, unsigned long arg)
>>  {
> 
> Again, this patch series is difficult to follow because we're
> introducing dead code until the mmu notifier actually has a path to be
> registered.

Sorry again for this, I will be careful for the sequence of the series later.

> We shouldn't be taking any faults until iopf is enabled,
> so it seems like we can add more of the core support alongside this
> code.

Could you be more specific about this? :-)

We can add a requirement that the VFIO_IOMMU_ENABLE_IOPF ioctl (in Patch 5)
must be called right after VFIO_SET_IOMMU, so that IOPF is enabled before
setting up any DMA mapping. Is this sufficient?

Thanks,
Shenming

> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling
  2021-05-18 18:58   ` Alex Williamson
@ 2021-05-21  6:37     ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-05-21  6:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:16 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> To optimize for fewer page fault handlings, we can pre-map more pages
>> than requested at once.
>>
>> Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we
>> could try further tuning.
> 
> I'd prefer that the series introduced full end-to-end functionality
> before trying to improve performance.  The pre-map value seems
> arbitrary here and as noted in the previous patch, the IOMMU API does
> not guarantee unmaps of ranges smaller than the original mapping.  This
> would need to map with single page granularity in order to guarantee
> page granularity at the mmu notifier when the IOMMU supports
> superpages.  Thanks,

Yeah, I will drop this patch in the current stage.

Thanks,
Shenming

> 
> Alex
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF
  2021-05-18 18:58   ` Alex Williamson
@ 2021-05-21  6:38     ` Shenming Lu
  2021-05-24 22:11       ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-05-21  6:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:17 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> Since enabling IOPF for devices may lead to a slow ramp up of performance,
>> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
>> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
>> registering the VFIO IOPF handler.
>>
>> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
>> inflight page faults when disabling.
>>
>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
>>  include/uapi/linux/vfio.h       |   6 +
>>  2 files changed, 226 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 01e296c6dc9e..7df5711e743a 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -71,6 +71,7 @@ struct vfio_iommu {
>>  	struct rb_root		dma_list;
>>  	struct blocking_notifier_head notifier;
>>  	struct mmu_notifier	mn;
>> +	struct mm_struct	*mm;
> 
> We currently have no requirement that a single mm is used for all
> container mappings.  Does enabling IOPF impose such a requirement?
> Shouldn't MAP/UNMAP enforce that?

Did you mean that there is a possibility that each vfio_dma in a
container has a different mm? If so, could we register one MMU
notifier for each vfio_dma in the MAP ioctl?

> 
>>  	unsigned int		dma_avail;
>>  	unsigned int		vaddr_invalid_count;
>>  	uint64_t		pgsize_bitmap;
>> @@ -81,6 +82,7 @@ struct vfio_iommu {
>>  	bool			dirty_page_tracking;
>>  	bool			pinned_page_dirty_scope;
>>  	bool			container_open;
>> +	bool			iopf_enabled;
>>  };
>>  
>>  struct vfio_domain {
>> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
>>  	return node ? iopf_group : NULL;
>>  }
>>  
>> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
>> +{
>> +	struct rb_node **link, *parent = NULL;
>> +	struct vfio_iopf_group *iopf_group;
>> +
>> +	mutex_lock(&iopf_group_list_lock);
>> +
>> +	link = &iopf_group_list.rb_node;
>> +
>> +	while (*link) {
>> +		parent = *link;
>> +		iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
>> +
>> +		if (new->iommu_group < iopf_group->iommu_group)
>> +			link = &(*link)->rb_left;
>> +		else
>> +			link = &(*link)->rb_right;
>> +	}
>> +
>> +	rb_link_node(&new->node, parent, link);
>> +	rb_insert_color(&new->node, &iopf_group_list);
>> +
>> +	mutex_unlock(&iopf_group_list_lock);
>> +}
>> +
>> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
>> +{
>> +	mutex_lock(&iopf_group_list_lock);
>> +	rb_erase(&old->node, &iopf_group_list);
>> +	mutex_unlock(&iopf_group_list_lock);
>> +}
>> +
>>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>  {
>>  	struct mm_struct *mm;
>> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
>>  	list_splice_tail(iova_copy, iova);
>>  }
>>  
>> +static int vfio_dev_domian_nested(struct device *dev, int *nested)
> 
> 
> s/domian/domain/
> 
> 
>> +{
>> +	struct iommu_domain *domain;
>> +
>> +	domain = iommu_get_domain_for_dev(dev);
>> +	if (!domain)
>> +		return -ENODEV;
>> +
>> +	return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);
> 
> 
> Wouldn't this be easier to use if it returned bool, false on -errno?

Make sense.

> 
> 
>> +}
>> +
>> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
>> +
>> +static int dev_enable_iopf(struct device *dev, void *data)
>> +{
>> +	int *enabled_dev_cnt = data;
>> +	int nested;
>> +	u32 flags;
>> +	int ret;
>> +
>> +	ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = vfio_dev_domian_nested(dev, &nested);
>> +	if (ret)
>> +		goto out_disable;
>> +
>> +	if (nested)
>> +		flags = FAULT_REPORT_NESTED_L2;
>> +	else
>> +		flags = FAULT_REPORT_FLAT;
>> +
>> +	ret = iommu_register_device_fault_handler(dev,
>> +				vfio_iommu_type1_dma_map_iopf, flags, dev);
>> +	if (ret)
>> +		goto out_disable;
>> +
>> +	(*enabled_dev_cnt)++;
>> +	return 0;
>> +
>> +out_disable:
>> +	iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
>> +	return ret;
>> +}
>> +
>> +static int dev_disable_iopf(struct device *dev, void *data)
>> +{
>> +	int *enabled_dev_cnt = data;
>> +
>> +	if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
>> +		return -1;
> 
> 
> Use an errno.>
> 
>> +
>> +	WARN_ON(iommu_unregister_device_fault_handler(dev));
>> +	WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
>> +
>> +	if (enabled_dev_cnt)
>> +		(*enabled_dev_cnt)--;
> 
> 
> Why do we need these counters?

We use this counter to record the number of IOPF enabled devices, and if
any device fails to be enabled, we will exit the loop (three nested loop)
and go to the unwind process, which will disable the first @enabled_dev_cnt
devices.

> 
>> +
>> +	return 0;
>> +}
>> +
>>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  					 struct iommu_group *iommu_group)
>>  {
>> @@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	struct iommu_domain_geometry geo;
>>  	LIST_HEAD(iova_copy);
>>  	LIST_HEAD(group_resv_regions);
>> +	int iopf_enabled_dev_cnt = 0;
>> +	struct vfio_iopf_group *iopf_group = NULL;
>>  
>>  	mutex_lock(&iommu->lock);
>>  
>> @@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	if (ret)
>>  		goto out_domain;
>>  
>> +	if (iommu->iopf_enabled) {
>> +		ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
>> +					       dev_enable_iopf);
>> +		if (ret)
>> +			goto out_detach;
>> +
>> +		iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
>> +		if (!iopf_group) {
>> +			ret = -ENOMEM;
>> +			goto out_detach;
>> +		}
>> +
>> +		iopf_group->iommu_group = iommu_group;
>> +		iopf_group->iommu = iommu;
>> +
>> +		vfio_link_iopf_group(iopf_group);
> 
> This seems backwards, once we enable iopf we can take a fault, so the
> structure to lookup the data for the device should be setup first.  I'm
> still not convinced this iopf_group rb tree is a good solution to get
> from the device to the iommu either.  vfio-core could traverse dev ->
> iommu_group -> vfio_group -> container -> iommu_data.

Make sense.

> 
> 
>> +	}
>> +
>>  	/* Get aperture info */
>>  	iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
>>  
>> @@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	vfio_test_domain_fgsp(domain);
>>  
>>  	/* replay mappings on new domains */
>> -	ret = vfio_iommu_replay(iommu, domain);
>> -	if (ret)
>> -		goto out_detach;
>> +	if (!iommu->iopf_enabled) {
>> +		ret = vfio_iommu_replay(iommu, domain);
>> +		if (ret)
>> +			goto out_detach;
>> +	}
> 
> Test in vfio_iommu_replay()?

Not yet, I will do more tests later.

> 
>>  
>>  	if (resv_msi) {
>>  		ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
>> @@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	iommu_domain_free(domain->domain);
>>  	vfio_iommu_iova_free(&iova_copy);
>>  	vfio_iommu_resv_free(&group_resv_regions);
>> +	if (iommu->iopf_enabled) {
>> +		if (iopf_group) {
>> +			vfio_unlink_iopf_group(iopf_group);
>> +			kfree(iopf_group);
>> +		}
>> +
>> +		iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
>> +					 dev_disable_iopf);
> 
> Backwards, disable fault reporting before unlinking lookup.

Make sense.

> 
>> +	}
>>  out_free:
>>  	kfree(domain);
>>  	kfree(group);
>> @@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>>  		if (!group)
>>  			continue;
>>  
>> +		if (iommu->iopf_enabled) {
>> +			struct vfio_iopf_group *iopf_group;
>> +
>> +			iopf_group = vfio_find_iopf_group(iommu_group);
>> +			if (!WARN_ON(!iopf_group)) {
>> +				vfio_unlink_iopf_group(iopf_group);
>> +				kfree(iopf_group);
>> +			}
>> +
>> +			iommu_group_for_each_dev(iommu_group, NULL,
>> +						 dev_disable_iopf);
> 
> 
> Backwards.  Also unregistering the fault handler can fail if there are
> pending faults.  It appears the fault handler and iopf lookup could also
> race with this function, ex. fault handler gets an iopf object before
> it's removed from the rb-tree, blocks trying to acquire iommu->lock,
> disable_iopf fails, detach proceeds, fault handler has use after free
> error.

We have WARN_ON for the failure of the unregistering.

Yeah, it seems that using vfio_iopf_group is not a good choice...

> 
> 
>> +		}
>> +
>>  		vfio_iommu_detach_group(domain, group);
>>  		update_dirty_scope = !group->pinned_page_dirty_scope;
>>  		list_del(&group->next);
>> @@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  
>>  	vfio_iommu_iova_free(&iommu->iova_list);
>>  
>> +	if (iommu->iopf_enabled) {
>> +		mmu_notifier_unregister(&iommu->mn, iommu->mm);
>> +		mmdrop(iommu->mm);
>> +	}
>> +
>>  	kfree(iommu);
>>  }
>>  
>> @@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
>>  	.invalidate_range	= mn_invalidate_range,
>>  };
>>  
>> +static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
>> +{
>> +	struct vfio_domain *d;
>> +	struct vfio_group *g;
>> +	struct vfio_iopf_group *iopf_group;
>> +	int enabled_dev_cnt = 0;
>> +	int ret;
>> +
>> +	if (!current->mm)
>> +		return -ENODEV;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	mmgrab(current->mm);
> 
> 
> As above, this is a new and undocumented requirement.
> 
> 
>> +	iommu->mm = current->mm;
>> +	iommu->mn.ops = &vfio_iommu_type1_mn_ops;
>> +	ret = mmu_notifier_register(&iommu->mn, current->mm);
>> +	if (ret)
>> +		goto out_drop;
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>> +		list_for_each_entry(g, &d->group_list, next) {
>> +			ret = iommu_group_for_each_dev(g->iommu_group,
>> +					&enabled_dev_cnt, dev_enable_iopf);
>> +			if (ret)
>> +				goto out_unwind;
>> +
>> +			iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
>> +			if (!iopf_group) {
>> +				ret = -ENOMEM;
>> +				goto out_unwind;
>> +			}
>> +
>> +			iopf_group->iommu_group = g->iommu_group;
>> +			iopf_group->iommu = iommu;
>> +
>> +			vfio_link_iopf_group(iopf_group);
>> +		}
>> +	}
>> +
>> +	iommu->iopf_enabled = true;
>> +	goto out_unlock;
>> +
>> +out_unwind:
>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>> +		list_for_each_entry(g, &d->group_list, next) {
>> +			iopf_group = vfio_find_iopf_group(g->iommu_group);
>> +			if (iopf_group) {
>> +				vfio_unlink_iopf_group(iopf_group);
>> +				kfree(iopf_group);
>> +			}
>> +
>> +			if (iommu_group_for_each_dev(g->iommu_group,
>> +					&enabled_dev_cnt, dev_disable_iopf))
>> +				goto out_unregister;
> 
> 
> This seems broken, we break out of the unwind if any group_for_each
> fails??

The iommu_group_for_each_dev function will return a negative value if
we have disabled all previously enabled devices (@enabled_dev_cnt).

> 
>> +		}
>> +	}
>> +
>> +out_unregister:
>> +	mmu_notifier_unregister(&iommu->mn, current->mm);
>> +
>> +out_drop:
>> +	iommu->mm = NULL;
>> +	mmdrop(current->mm);
>> +
>> +out_unlock:
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +}
> 
> 
> Is there an assumption that the user does this before performing any
> mapping?  I don't see where current mappings would be converted which
> means we'll have pinned pages leaked and vfio_dma objects without a
> pinned page bitmap.

Yeah, we have an assumption that this ioctl (ENABLE_IOPF) must be called
before any MAP/UNMAP ioctl...

I will document these later. :-)

> 
>> +
>>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  				   unsigned int cmd, unsigned long arg)
>>  {
>> @@ -3457,6 +3672,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  		return vfio_iommu_type1_unmap_dma(iommu, arg);
>>  	case VFIO_IOMMU_DIRTY_PAGES:
>>  		return vfio_iommu_type1_dirty_pages(iommu, arg);
>> +	case VFIO_IOMMU_ENABLE_IOPF:
>> +		return vfio_iommu_type1_enable_iopf(iommu);
>>  	default:
>>  		return -ENOTTY;
>>  	}
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 8ce36c1d53ca..5497036bebdc 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -1208,6 +1208,12 @@ struct vfio_iommu_type1_dirty_bitmap_get {
>>  
>>  #define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
>>  
>> +/*
>> + * IOCTL to enable IOPF for the container.
>> + * Called right after VFIO_SET_IOMMU.
>> + */
>> +#define VFIO_IOMMU_ENABLE_IOPF             _IO(VFIO_TYPE, VFIO_BASE + 18)
> 
> 
> We have extensions and flags and capabilities, and nothing tells the
> user whether this feature is available to them without trying it?

Yeah, we can have a capability for IOPF.
It is rough now... I will try to extend it later. :-)

Thanks,
Shenming

> 
> 
>> +
>>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>>  
>>  /*
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler
  2021-05-18 18:58   ` Alex Williamson
@ 2021-05-21  6:38     ` Shenming Lu
  2021-05-24 22:11       ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-05-21  6:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:14 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
>> for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
>> serve the reported page faults from the IOMMU driver.
>>
>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 114 ++++++++++++++++++++++++++++++++
>>  1 file changed, 114 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 45cbfd4879a5..ab0ff60ee207 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -101,6 +101,7 @@ struct vfio_dma {
>>  	struct task_struct	*task;
>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>  	unsigned long		*bitmap;
>> +	unsigned long		*iopf_mapped_bitmap;
>>  };
>>  
>>  struct vfio_batch {
>> @@ -141,6 +142,16 @@ struct vfio_regions {
>>  	size_t len;
>>  };
>>  
>> +/* A global IOPF enabled group list */
>> +static struct rb_root iopf_group_list = RB_ROOT;
>> +static DEFINE_MUTEX(iopf_group_list_lock);
>> +
>> +struct vfio_iopf_group {
>> +	struct rb_node		node;
>> +	struct iommu_group	*iommu_group;
>> +	struct vfio_iommu	*iommu;
>> +};
>> +
>>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>>  					(!list_empty(&iommu->domain_list))
>>  
>> @@ -157,6 +168,10 @@ struct vfio_regions {
>>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
>>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>  
>> +#define IOPF_MAPPED_BITMAP_GET(dma, i)	\
>> +			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
>> +			       >> ((i) % BITS_PER_LONG)) & 0x1)
> 
> 
> Can't we just use test_bit()?

Yeah, we can use it.

> 
> 
>> +
>>  #define WAITED 1
>>  
>>  static int put_pfn(unsigned long pfn, int prot);
>> @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>>  	return ret;
>>  }
>>  
>> +/*
>> + * Helper functions for iopf_group_list
>> + */
>> +static struct vfio_iopf_group *
>> +vfio_find_iopf_group(struct iommu_group *iommu_group)
>> +{
>> +	struct vfio_iopf_group *iopf_group;
>> +	struct rb_node *node;
>> +
>> +	mutex_lock(&iopf_group_list_lock);
>> +
>> +	node = iopf_group_list.rb_node;
>> +
>> +	while (node) {
>> +		iopf_group = rb_entry(node, struct vfio_iopf_group, node);
>> +
>> +		if (iommu_group < iopf_group->iommu_group)
>> +			node = node->rb_left;
>> +		else if (iommu_group > iopf_group->iommu_group)
>> +			node = node->rb_right;
>> +		else
>> +			break;
>> +	}
>> +
>> +	mutex_unlock(&iopf_group_list_lock);
>> +	return node ? iopf_group : NULL;
>> +}
> 
> This looks like a pretty heavy weight operation per DMA fault.
> 
> I'm also suspicious of this validity of this iopf_group after we've
> dropped the locking, the ordering of patches makes this very confusing.

My thought was to include the handling of DMA faults completely in the type1
backend by introducing the vfio_iopf_group struct. But it seems that introducing
a struct with an unknown lifecycle causes more problems...
I will use the path from vfio-core as in the v2 for simplicity and validity.

Sorry for the confusing, I will reconstruct the series later. :-)

> 
>> +
>>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>  {
>>  	struct mm_struct *mm;
>> @@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>>  	return -EINVAL;
>>  }
>>  
>> +/* VFIO I/O Page Fault handler */
>> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
> 
>>From the comment, this seems like the IOMMU fault handler (the
> construction of this series makes this difficult to follow) and
> eventually it handles more than DMA mapping, for example transferring
> faults to the device driver.  "dma_map_iopf" seems like a poorly scoped
> name.

Maybe just call it dev_fault_handler?

> 
>> +{
>> +	struct device *dev = (struct device *)data;
>> +	struct iommu_group *iommu_group;
>> +	struct vfio_iopf_group *iopf_group;
>> +	struct vfio_iommu *iommu;
>> +	struct vfio_dma *dma;
>> +	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
>> +	int access_flags = 0;
>> +	unsigned long bit_offset, vaddr, pfn;
>> +	int ret;
>> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>> +	struct iommu_page_response resp = {0};
>> +
>> +	if (fault->type != IOMMU_FAULT_PAGE_REQ)
>> +		return -EOPNOTSUPP;
>> +
>> +	iommu_group = iommu_group_get(dev);
>> +	if (!iommu_group)
>> +		return -ENODEV;
>> +
>> +	iopf_group = vfio_find_iopf_group(iommu_group);
>> +	iommu_group_put(iommu_group);
>> +	if (!iopf_group)
>> +		return -ENODEV;
>> +
>> +	iommu = iopf_group->iommu;
>> +
>> +	mutex_lock(&iommu->lock);
> 
> Again, I'm dubious of our ability to grab this lock from an object with
> an unknown lifecycle and races we might have with that group being
> detached or DMA unmapped.  Also, how effective is enabling IOMMU page
> faulting if we're serializing all faults within a container context?

Did you mean "efficient"?
I also worry about this as the mapping and unmapping of the faulting pages
are all with the same lock...
Is there a way to parallel them? Or could we have more fine grained lock
control?

> 
>> +
>> +	ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
>> +	if (ret < 0)
>> +		goto out_invalid;
>> +
>> +	if (fault->prm.perm & IOMMU_FAULT_PERM_READ)
>> +		access_flags |= IOMMU_READ;
>> +	if (fault->prm.perm & IOMMU_FAULT_PERM_WRITE)
>> +		access_flags |= IOMMU_WRITE;
>> +	if ((dma->prot & access_flags) != access_flags)
>> +		goto out_invalid;
>> +
>> +	bit_offset = (iova - dma->iova) >> PAGE_SHIFT;
>> +	if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset))
>> +		goto out_success;
> 
> If the page is mapped, why did we get a fault?  Should we be returning
> success for a fault we shouldn't have received and did nothing to
> resolve?  We're also referencing a bitmap that has only been declared
> and never allocated at this point in the patch series.

Image that we have two inflight page faults which target the same iova,
shouldn't the later one just return SUCCESS since the previous one has
been handled and the mapping has been established?

I will allocate the bitmap first.

Thanks,
Shenming

> 
>> +
>> +	vaddr = iova - dma->iova + dma->vaddr;
>> +
>> +	if (vfio_pin_page_external(dma, vaddr, &pfn, true))
>> +		goto out_invalid;
>> +
>> +	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
>> +		if (put_pfn(pfn, dma->prot))
>> +			vfio_lock_acct(dma, -1, true);
>> +		goto out_invalid;
>> +	}
>> +
>> +	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
>> +
>> +out_success:
>> +	status = IOMMU_PAGE_RESP_SUCCESS;
>> +
>> +out_invalid:
>> +	mutex_unlock(&iommu->lock);
>> +	resp.version		= IOMMU_PAGE_RESP_VERSION_1;
>> +	resp.grpid		= fault->prm.grpid;
>> +	resp.code		= status;
>> +	iommu_page_response(dev, &resp);
>> +	return 0;
>> +}
>> +
>>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  				   unsigned int cmd, unsigned long arg)
>>  {
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled
  2021-05-18 18:58   ` Alex Williamson
@ 2021-05-21  6:39     ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-05-21  6:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:18 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> If IOPF enabled for the VFIO container, there is no need to statically
>> pin and map the entire DMA range, we can do it on demand. And unmap
>> according to the IOPF mapped bitmap when removing vfio_dma.
>>
>> Note that we still mark all pages dirty even if IOPF enabled, we may
>> add IOPF-based fine grained dirty tracking support in the future.
>>
>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 38 +++++++++++++++++++++++++++------
>>  1 file changed, 32 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 7df5711e743a..dcc93c3b258c 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -175,6 +175,7 @@ struct vfio_iopf_group {
>>  #define IOPF_MAPPED_BITMAP_GET(dma, i)	\
>>  			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
>>  			       >> ((i) % BITS_PER_LONG)) & 0x1)  
>> +#define IOPF_MAPPED_BITMAP_BYTES(n)	DIRTY_BITMAP_BYTES(n)
>>  
>>  #define WAITED 1
>>  
>> @@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>  	 * already pinned and accounted. Accouting should be done if there is no
>>  	 * iommu capable domain in the container.
>>  	 */
>> -	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
>> +	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
>> +			iommu->iopf_enabled;
>>  
>>  	for (i = 0; i < npage; i++) {
>>  		struct vfio_pfn *vpfn;
>> @@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>>  
>>  	mutex_lock(&iommu->lock);
>>  
>> -	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
>> +	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
>> +			iommu->iopf_enabled;
> 
> pin/unpin are actually still pinning pages, why does iopf exempt them
> from accounting?

If iopf_enabled is true, do_accounting will be true too, we will account
the external pinned pages?

> 
> 
>>  	for (i = 0; i < npage; i++) {
>>  		struct vfio_dma *dma;
>>  		dma_addr_t iova;
>> @@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>>  	if (!dma->size)
>>  		return 0;
>>  
>> -	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
>>  		return 0;
>>  
>>  	/*
>> @@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
>>  	}
>>  }
>>  
>> +static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma)
>> +{
>> +	vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
>> +
>> +	kfree(dma->iopf_mapped_bitmap);
>> +}
>> +
>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>>  	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
>>  	vfio_unmap_unpin(iommu, dma, true);
>>  	vfio_unlink_dma(iommu, dma);
>> +	if (iommu->iopf_enabled)
>> +		vfio_dma_clean_iopf(iommu, dma);
>>  	put_task_struct(dma->task);
>>  	vfio_dma_bitmap_free(dma);
>>  	if (dma->vaddr_invalid) {
>> @@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>>  	 * mark all pages dirty if any IOMMU capable device is not able
>>  	 * to report dirty pages and all pages are pinned and mapped.
>>  	 */
>> -	if (iommu->num_non_pinned_groups && dma->iommu_mapped)
>> +	if (iommu->num_non_pinned_groups &&
>> +	    (dma->iommu_mapped || iommu->iopf_enabled))
>>  		bitmap_set(dma->bitmap, 0, nbits);
> 
> This seems like really poor integration of iopf into dirty page
> tracking.  I'd expect dirty logging to flush the mapped pages and
> write faults to mark pages dirty.  Shouldn't the fault handler also
> provide only the access faulted, so for example a read fault wouldn't
> mark the page dirty?
I just want to keep the behavior here as before, if IOPF enabled, we
will still mark all pages dirty.

We can distinguish between write and read faults in the fault handler,
so there is a way to add IOPF-based fine grained dirty tracking support...
But I am not sure whether there is a need to implement this, we can
consider this in the future?

> 
>>  
>>  	if (shift) {
>> @@ -1772,6 +1785,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  		goto out_unlock;
>>  	}
>>  
>> +	if (iommu->iopf_enabled) {
>> +		dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES(
>> +						size >> PAGE_SHIFT), GFP_KERNEL);
>> +		if (!dma->iopf_mapped_bitmap) {
>> +			ret = -ENOMEM;
>> +			kfree(dma);
>> +			goto out_unlock;
>> +		}
> 
> 
> So we're assuming nothing can fault and therefore nothing can reference
> the iopf_mapped_bitmap until this point in the series?

I will move this to the front of this series.

Thanks,
Shenming

> 
> 
>> +	}
>> +
>>  	iommu->dma_avail--;
>>  	dma->iova = iova;
>>  	dma->vaddr = vaddr;
>> @@ -1811,8 +1834,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	/* Insert zero-sized and grow as we map chunks of it */
>>  	vfio_link_dma(iommu, dma);
>>  
>> -	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
>> -	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>> +	/*
>> +	 * Don't pin and map if container doesn't contain IOMMU capable domain,
>> +	 * or IOPF enabled for the container.
>> +	 */
>> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
>>  		dma->size = size;
>>  	else
>>  		ret = vfio_pin_map_dma(iommu, dma, size);
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support
  2021-05-18 18:58   ` Alex Williamson
@ 2021-05-21  6:39     ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-05-21  6:39 UTC (permalink / raw)
  To: Alex Williamson, Kevin Tian
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Lu Baolu, yi.l.liu,
	Christoph Hellwig, Jonathan Cameron, Barry Song, wanghaibin.wang,
	yuzenghui

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:19 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> Some devices only allow selective DMA faulting. Similar to the selective
>> dirty page tracking, the vendor driver can call vfio_pin_pages() to
>> indicate the non-faultable scope, we add a new struct vfio_range to
>> record it, then when the IOPF handler receives any page request out
>> of the scope, we can directly return with an invalid response.
> 
> Seems like this highlights a deficiency in the design, that the user
> can't specify mappings as iopf enabled or disabled.  Also, if the
> vendor driver has pinned pages within the range, shouldn't that prevent
> them from faulting in the first place?  Why do we need yet more
> tracking structures?  Pages pinned by the vendor driver need to count
> against the user's locked memory limits regardless of iopf.  Thanks,

Currently we only have a vfio_pfn struct to track the external pinned pages
(single page granularity), so I add a vfio_range struct for efficient lookup.

Yeah, by this patch, for the non-pinned scope, we can directly return INVALID,
but for the pinned(non-faultable) scope, tracking the pinned range doesn't seem
to help more...

Thanks,
Shenming

> 
> Alex
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
  2021-05-18 18:58   ` Alex Williamson
@ 2021-05-21  7:59     ` Shenming Lu
  2021-05-24 13:11       ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-05-21  7:59 UTC (permalink / raw)
  To: Alex Williamson, Eric Auger
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, kvm, linux-kernel, linux-arm-kernel,
	iommu, linux-api, Kevin Tian, Lu Baolu, yi.l.liu,
	Christoph Hellwig, Jonathan Cameron, Barry Song, wanghaibin.wang,
	yuzenghui

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:20 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> To set up nested mode, drivers such as vfio_pci need to register a
>> handler to receive stage/level 1 faults from the IOMMU, but since
>> currently each device can only have one iommu dev fault handler,
>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
>> we choose to update the registered handler (a consolidated one) via
>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
>> stage 1 faults in the handler to the guest through a newly added
>> vfio_device_ops callback.
> 
> Are there proposed in-kernel drivers that would use any of these
> symbols?

I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
use these symbols to consolidate the two page fault handlers into one.

[1] https://patchwork.kernel.org/project/kvm/cover/20210411114659.15051-1-eric.auger@redhat.com/

> 
>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
>> ---
>>  drivers/vfio/vfio.c             | 81 +++++++++++++++++++++++++++++++++
>>  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
>>  include/linux/vfio.h            | 12 +++++
>>  3 files changed, 141 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index 44c8dfabf7de..4245f15914bf 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>>  
>> +/*
>> + * Register/Update the VFIO IOPF handler to receive
>> + * nested stage/level 1 faults.
>> + */
>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
>> +{
>> +	struct vfio_container *container;
>> +	struct vfio_group *group;
>> +	struct vfio_iommu_driver *driver;
>> +	int ret;
>> +
>> +	if (!dev)
>> +		return -EINVAL;
>> +
>> +	group = vfio_group_get_from_dev(dev);
>> +	if (!group)
>> +		return -ENODEV;
>> +
>> +	ret = vfio_group_add_container_user(group);
>> +	if (ret)
>> +		goto out;
>> +
>> +	container = group->container;
>> +	driver = container->iommu_driver;
>> +	if (likely(driver && driver->ops->register_handler))
>> +		ret = driver->ops->register_handler(container->iommu_data, dev);
>> +	else
>> +		ret = -ENOTTY;
>> +
>> +	vfio_group_try_dissolve_container(group);
>> +
>> +out:
>> +	vfio_group_put(group);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
>> +
>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
>> +{
>> +	struct vfio_container *container;
>> +	struct vfio_group *group;
>> +	struct vfio_iommu_driver *driver;
>> +	int ret;
>> +
>> +	if (!dev)
>> +		return -EINVAL;
>> +
>> +	group = vfio_group_get_from_dev(dev);
>> +	if (!group)
>> +		return -ENODEV;
>> +
>> +	ret = vfio_group_add_container_user(group);
>> +	if (ret)
>> +		goto out;
>> +
>> +	container = group->container;
>> +	driver = container->iommu_driver;
>> +	if (likely(driver && driver->ops->unregister_handler))
>> +		ret = driver->ops->unregister_handler(container->iommu_data, dev);
>> +	else
>> +		ret = -ENOTTY;
>> +
>> +	vfio_group_try_dissolve_container(group);
>> +
>> +out:
>> +	vfio_group_put(group);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
>> +
>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
>> +{
>> +	struct vfio_device *device = dev_get_drvdata(dev);
>> +
>> +	if (unlikely(!device->ops->transfer))
>> +		return -EOPNOTSUPP;
>> +
>> +	return device->ops->transfer(device->device_data, fault);
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
>> +
>>  /**
>>   * Module/class support
>>   */
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index ba2b5a1cf6e9..9d1adeddb303 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>>  	struct vfio_batch batch;
>>  	struct vfio_range *range;
>>  	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
>> -	int access_flags = 0;
>> +	int access_flags = 0, nested;
>>  	size_t premap_len, map_len, mapped_len = 0;
>>  	unsigned long bit_offset, vaddr, pfn, i, npages;
>>  	int ret;
>>  	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>>  	struct iommu_page_response resp = {0};
>>  
>> +	if (vfio_dev_domian_nested(dev, &nested))
>> +		return -ENODEV;
>> +
>> +	/*
>> +	 * When configured in nested mode, further deliver the
>> +	 * stage/level 1 faults to the guest.
>> +	 */
>> +	if (nested) {
>> +		bool l2;
>> +
>> +		if (fault->type == IOMMU_FAULT_PAGE_REQ)
>> +			l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
>> +		if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
>> +			l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
>> +
>> +		if (!l2)
>> +			return vfio_transfer_iommu_fault(dev, fault);
>> +	}
>> +
>>  	if (fault->type != IOMMU_FAULT_PAGE_REQ)
>>  		return -EOPNOTSUPP;
>>  
>> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
>>  	wake_up_all(&iommu->vaddr_wait);
>>  }
>>  
>> +static int vfio_iommu_type1_register_handler(void *iommu_data,
>> +					     struct device *dev)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +
>> +	if (iommu->iopf_enabled)
>> +		return iommu_update_device_fault_handler(dev, ~0,
>> +						FAULT_REPORT_NESTED_L1);
>> +	else
>> +		return iommu_register_device_fault_handler(dev,
>> +						vfio_iommu_type1_dma_map_iopf,
>> +						FAULT_REPORT_NESTED_L1, dev);
>> +}
>> +
>> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
>> +					       struct device *dev)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +
>> +	if (iommu->iopf_enabled)
>> +		return iommu_update_device_fault_handler(dev,
>> +						~FAULT_REPORT_NESTED_L1, 0);
>> +	else
>> +		return iommu_unregister_device_fault_handler(dev);
>> +}
> 
> 
> The path through vfio to register this is pretty ugly, but I don't see
> any reason for the the update interfaces here, the previously
> registered handler just changes its behavior.

Yeah, this seems not an elegant way...

If IOPF(L2) enabled, the fault handler has already been registered, so for
nested mode setup, we only need to change the flags of the handler in the
IOMMU driver to receive L1 faults.
(assume that L1 IOPF is configured after L2 IOPF)

Currently each device can only have one iommu dev fault handler, and L1
and L2 IOPF are configured separately in nested mode, I am also wondering
that is there a better solution for this.

Thanks,
Shenming

> 
> 
>> +
>>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>>  	.name			= "vfio-iommu-type1",
>>  	.owner			= THIS_MODULE,
>> @@ -4216,6 +4261,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>>  	.dma_rw			= vfio_iommu_type1_dma_rw,
>>  	.group_iommu_domain	= vfio_iommu_type1_group_iommu_domain,
>>  	.notify			= vfio_iommu_type1_notify,
>> +	.register_handler	= vfio_iommu_type1_register_handler,
>> +	.unregister_handler	= vfio_iommu_type1_unregister_handler,
>>  };
>>  
>>  static int __init vfio_iommu_type1_init(void)
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index a7b426d579df..4621d8f0395d 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -29,6 +29,8 @@
>>   * @match: Optional device name match callback (return: 0 for no-match, >0 for
>>   *         match, -errno for abort (ex. match with insufficient or incorrect
>>   *         additional args)
>> + * @transfer: Optional. Transfer the received stage/level 1 faults to the guest
>> + *            for nested mode.
>>   */
>>  struct vfio_device_ops {
>>  	char	*name;
>> @@ -43,6 +45,7 @@ struct vfio_device_ops {
>>  	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
>>  	void	(*request)(void *device_data, unsigned int count);
>>  	int	(*match)(void *device_data, char *buf);
>> +	int	(*transfer)(void *device_data, struct iommu_fault *fault);
>>  };
>>  
>>  extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
>> @@ -100,6 +103,10 @@ struct vfio_iommu_driver_ops {
>>  						   struct iommu_group *group);
>>  	void		(*notify)(void *iommu_data,
>>  				  enum vfio_iommu_notify_type event);
>> +	int		(*register_handler)(void *iommu_data,
>> +					    struct device *dev);
>> +	int		(*unregister_handler)(void *iommu_data,
>> +					      struct device *dev);
>>  };
>>  
>>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
>> @@ -161,6 +168,11 @@ extern int vfio_unregister_notifier(struct device *dev,
>>  struct kvm;
>>  extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
>>  
>> +extern int vfio_iommu_dev_fault_handler_register_nested(struct device *dev);
>> +extern int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev);
>> +extern int vfio_transfer_iommu_fault(struct device *dev,
>> +				     struct iommu_fault *fault);
>> +
>>  /*
>>   * Sub-module helpers
>>   */
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
  2021-05-21  7:59     ` Shenming Lu
@ 2021-05-24 13:11       ` Shenming Lu
  2021-05-24 22:11         ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-05-24 13:11 UTC (permalink / raw)
  To: Alex Williamson, Eric Auger
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, kvm, linux-kernel, linux-arm-kernel,
	iommu, linux-api, Kevin Tian, Lu Baolu, yi.l.liu,
	Christoph Hellwig, Jonathan Cameron, Barry Song, wanghaibin.wang,
	yuzenghui

On 2021/5/21 15:59, Shenming Lu wrote:
> On 2021/5/19 2:58, Alex Williamson wrote:
>> On Fri, 9 Apr 2021 11:44:20 +0800
>> Shenming Lu <lushenming@huawei.com> wrote:
>>
>>> To set up nested mode, drivers such as vfio_pci need to register a
>>> handler to receive stage/level 1 faults from the IOMMU, but since
>>> currently each device can only have one iommu dev fault handler,
>>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
>>> we choose to update the registered handler (a consolidated one) via
>>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
>>> stage 1 faults in the handler to the guest through a newly added
>>> vfio_device_ops callback.
>>
>> Are there proposed in-kernel drivers that would use any of these
>> symbols?
> 
> I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
> use these symbols to consolidate the two page fault handlers into one.
> 
> [1] https://patchwork.kernel.org/project/kvm/cover/20210411114659.15051-1-eric.auger@redhat.com/
> 
>>
>>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
>>> ---
>>>  drivers/vfio/vfio.c             | 81 +++++++++++++++++++++++++++++++++
>>>  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
>>>  include/linux/vfio.h            | 12 +++++
>>>  3 files changed, 141 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>> index 44c8dfabf7de..4245f15914bf 100644
>>> --- a/drivers/vfio/vfio.c
>>> +++ b/drivers/vfio/vfio.c
>>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
>>>  }
>>>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>>>  
>>> +/*
>>> + * Register/Update the VFIO IOPF handler to receive
>>> + * nested stage/level 1 faults.
>>> + */
>>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
>>> +{
>>> +	struct vfio_container *container;
>>> +	struct vfio_group *group;
>>> +	struct vfio_iommu_driver *driver;
>>> +	int ret;
>>> +
>>> +	if (!dev)
>>> +		return -EINVAL;
>>> +
>>> +	group = vfio_group_get_from_dev(dev);
>>> +	if (!group)
>>> +		return -ENODEV;
>>> +
>>> +	ret = vfio_group_add_container_user(group);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	container = group->container;
>>> +	driver = container->iommu_driver;
>>> +	if (likely(driver && driver->ops->register_handler))
>>> +		ret = driver->ops->register_handler(container->iommu_data, dev);
>>> +	else
>>> +		ret = -ENOTTY;
>>> +
>>> +	vfio_group_try_dissolve_container(group);
>>> +
>>> +out:
>>> +	vfio_group_put(group);
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
>>> +
>>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
>>> +{
>>> +	struct vfio_container *container;
>>> +	struct vfio_group *group;
>>> +	struct vfio_iommu_driver *driver;
>>> +	int ret;
>>> +
>>> +	if (!dev)
>>> +		return -EINVAL;
>>> +
>>> +	group = vfio_group_get_from_dev(dev);
>>> +	if (!group)
>>> +		return -ENODEV;
>>> +
>>> +	ret = vfio_group_add_container_user(group);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	container = group->container;
>>> +	driver = container->iommu_driver;
>>> +	if (likely(driver && driver->ops->unregister_handler))
>>> +		ret = driver->ops->unregister_handler(container->iommu_data, dev);
>>> +	else
>>> +		ret = -ENOTTY;
>>> +
>>> +	vfio_group_try_dissolve_container(group);
>>> +
>>> +out:
>>> +	vfio_group_put(group);
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
>>> +
>>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
>>> +{
>>> +	struct vfio_device *device = dev_get_drvdata(dev);
>>> +
>>> +	if (unlikely(!device->ops->transfer))
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	return device->ops->transfer(device->device_data, fault);
>>> +}
>>> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
>>> +
>>>  /**
>>>   * Module/class support
>>>   */
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index ba2b5a1cf6e9..9d1adeddb303 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>>>  	struct vfio_batch batch;
>>>  	struct vfio_range *range;
>>>  	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
>>> -	int access_flags = 0;
>>> +	int access_flags = 0, nested;
>>>  	size_t premap_len, map_len, mapped_len = 0;
>>>  	unsigned long bit_offset, vaddr, pfn, i, npages;
>>>  	int ret;
>>>  	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>>>  	struct iommu_page_response resp = {0};
>>>  
>>> +	if (vfio_dev_domian_nested(dev, &nested))
>>> +		return -ENODEV;
>>> +
>>> +	/*
>>> +	 * When configured in nested mode, further deliver the
>>> +	 * stage/level 1 faults to the guest.
>>> +	 */
>>> +	if (nested) {
>>> +		bool l2;
>>> +
>>> +		if (fault->type == IOMMU_FAULT_PAGE_REQ)
>>> +			l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
>>> +		if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
>>> +			l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
>>> +
>>> +		if (!l2)
>>> +			return vfio_transfer_iommu_fault(dev, fault);
>>> +	}
>>> +
>>>  	if (fault->type != IOMMU_FAULT_PAGE_REQ)
>>>  		return -EOPNOTSUPP;
>>>  
>>> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
>>>  	wake_up_all(&iommu->vaddr_wait);
>>>  }
>>>  
>>> +static int vfio_iommu_type1_register_handler(void *iommu_data,
>>> +					     struct device *dev)
>>> +{
>>> +	struct vfio_iommu *iommu = iommu_data;
>>> +
>>> +	if (iommu->iopf_enabled)
>>> +		return iommu_update_device_fault_handler(dev, ~0,
>>> +						FAULT_REPORT_NESTED_L1);
>>> +	else
>>> +		return iommu_register_device_fault_handler(dev,
>>> +						vfio_iommu_type1_dma_map_iopf,
>>> +						FAULT_REPORT_NESTED_L1, dev);
>>> +}
>>> +
>>> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
>>> +					       struct device *dev)
>>> +{
>>> +	struct vfio_iommu *iommu = iommu_data;
>>> +
>>> +	if (iommu->iopf_enabled)
>>> +		return iommu_update_device_fault_handler(dev,
>>> +						~FAULT_REPORT_NESTED_L1, 0);
>>> +	else
>>> +		return iommu_unregister_device_fault_handler(dev);
>>> +}
>>
>>
>> The path through vfio to register this is pretty ugly, but I don't see
>> any reason for the the update interfaces here, the previously
>> registered handler just changes its behavior.
> 
> Yeah, this seems not an elegant way...
> 
> If IOPF(L2) enabled, the fault handler has already been registered, so for
> nested mode setup, we only need to change the flags of the handler in the
> IOMMU driver to receive L1 faults.
> (assume that L1 IOPF is configured after L2 IOPF)
> 
> Currently each device can only have one iommu dev fault handler, and L1
> and L2 IOPF are configured separately in nested mode, I am also wondering
> that is there a better solution for this.

Hi Alex, Eric,

Let me simply add, maybe there is another way for this:
Would it be better to set host IOPF enabled (L2 faults) in the VFIO_IOMMU_MAP_DMA
ioctl (no need to add a new ioctl, and we can specify whether this mapping is IOPF
available or statically pinned), and set guest IOPF enabled (L1 faults) in the
VFIO_IOMMU_SET_PASID_TABLE (from Eric's series) ioctl?
And we have no requirement for the sequence of these two ioctls. The first called
one will register the handler, and the later one will just update the handler...

Thanks,
Shenming

> 
> Thanks,
> Shenming
> 
>>
>>
>>> +
>>>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>>>  	.name			= "vfio-iommu-type1",
>>>  	.owner			= THIS_MODULE,
>>> @@ -4216,6 +4261,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>>>  	.dma_rw			= vfio_iommu_type1_dma_rw,
>>>  	.group_iommu_domain	= vfio_iommu_type1_group_iommu_domain,
>>>  	.notify			= vfio_iommu_type1_notify,
>>> +	.register_handler	= vfio_iommu_type1_register_handler,
>>> +	.unregister_handler	= vfio_iommu_type1_unregister_handler,
>>>  };
>>>  
>>>  static int __init vfio_iommu_type1_init(void)
>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>> index a7b426d579df..4621d8f0395d 100644
>>> --- a/include/linux/vfio.h
>>> +++ b/include/linux/vfio.h
>>> @@ -29,6 +29,8 @@
>>>   * @match: Optional device name match callback (return: 0 for no-match, >0 for
>>>   *         match, -errno for abort (ex. match with insufficient or incorrect
>>>   *         additional args)
>>> + * @transfer: Optional. Transfer the received stage/level 1 faults to the guest
>>> + *            for nested mode.
>>>   */
>>>  struct vfio_device_ops {
>>>  	char	*name;
>>> @@ -43,6 +45,7 @@ struct vfio_device_ops {
>>>  	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
>>>  	void	(*request)(void *device_data, unsigned int count);
>>>  	int	(*match)(void *device_data, char *buf);
>>> +	int	(*transfer)(void *device_data, struct iommu_fault *fault);
>>>  };
>>>  
>>>  extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
>>> @@ -100,6 +103,10 @@ struct vfio_iommu_driver_ops {
>>>  						   struct iommu_group *group);
>>>  	void		(*notify)(void *iommu_data,
>>>  				  enum vfio_iommu_notify_type event);
>>> +	int		(*register_handler)(void *iommu_data,
>>> +					    struct device *dev);
>>> +	int		(*unregister_handler)(void *iommu_data,
>>> +					      struct device *dev);
>>>  };
>>>  
>>>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
>>> @@ -161,6 +168,11 @@ extern int vfio_unregister_notifier(struct device *dev,
>>>  struct kvm;
>>>  extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
>>>  
>>> +extern int vfio_iommu_dev_fault_handler_register_nested(struct device *dev);
>>> +extern int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev);
>>> +extern int vfio_transfer_iommu_fault(struct device *dev,
>>> +				     struct iommu_fault *fault);
>>> +
>>>  /*
>>>   * Sub-module helpers
>>>   */
>>
>> .
>>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
  2021-05-24 13:11       ` Shenming Lu
@ 2021-05-24 22:11         ` Alex Williamson
  2021-05-27 11:03           ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-24 22:11 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Eric Auger, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Mon, 24 May 2021 21:11:11 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> On 2021/5/21 15:59, Shenming Lu wrote:
> > On 2021/5/19 2:58, Alex Williamson wrote:  
> >> On Fri, 9 Apr 2021 11:44:20 +0800
> >> Shenming Lu <lushenming@huawei.com> wrote:
> >>  
> >>> To set up nested mode, drivers such as vfio_pci need to register a
> >>> handler to receive stage/level 1 faults from the IOMMU, but since
> >>> currently each device can only have one iommu dev fault handler,
> >>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
> >>> we choose to update the registered handler (a consolidated one) via
> >>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
> >>> stage 1 faults in the handler to the guest through a newly added
> >>> vfio_device_ops callback.  
> >>
> >> Are there proposed in-kernel drivers that would use any of these
> >> symbols?  
> > 
> > I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
> > use these symbols to consolidate the two page fault handlers into one.
> > 
> > [1] https://patchwork.kernel.org/project/kvm/cover/20210411114659.15051-1-eric.auger@redhat.com/
> >   
> >>  
> >>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> >>> ---
> >>>  drivers/vfio/vfio.c             | 81 +++++++++++++++++++++++++++++++++
> >>>  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
> >>>  include/linux/vfio.h            | 12 +++++
> >>>  3 files changed, 141 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>> index 44c8dfabf7de..4245f15914bf 100644
> >>> --- a/drivers/vfio/vfio.c
> >>> +++ b/drivers/vfio/vfio.c
> >>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
> >>>  
> >>> +/*
> >>> + * Register/Update the VFIO IOPF handler to receive
> >>> + * nested stage/level 1 faults.
> >>> + */
> >>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
> >>> +{
> >>> +	struct vfio_container *container;
> >>> +	struct vfio_group *group;
> >>> +	struct vfio_iommu_driver *driver;
> >>> +	int ret;
> >>> +
> >>> +	if (!dev)
> >>> +		return -EINVAL;
> >>> +
> >>> +	group = vfio_group_get_from_dev(dev);
> >>> +	if (!group)
> >>> +		return -ENODEV;
> >>> +
> >>> +	ret = vfio_group_add_container_user(group);
> >>> +	if (ret)
> >>> +		goto out;
> >>> +
> >>> +	container = group->container;
> >>> +	driver = container->iommu_driver;
> >>> +	if (likely(driver && driver->ops->register_handler))
> >>> +		ret = driver->ops->register_handler(container->iommu_data, dev);
> >>> +	else
> >>> +		ret = -ENOTTY;
> >>> +
> >>> +	vfio_group_try_dissolve_container(group);
> >>> +
> >>> +out:
> >>> +	vfio_group_put(group);
> >>> +	return ret;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
> >>> +
> >>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
> >>> +{
> >>> +	struct vfio_container *container;
> >>> +	struct vfio_group *group;
> >>> +	struct vfio_iommu_driver *driver;
> >>> +	int ret;
> >>> +
> >>> +	if (!dev)
> >>> +		return -EINVAL;
> >>> +
> >>> +	group = vfio_group_get_from_dev(dev);
> >>> +	if (!group)
> >>> +		return -ENODEV;
> >>> +
> >>> +	ret = vfio_group_add_container_user(group);
> >>> +	if (ret)
> >>> +		goto out;
> >>> +
> >>> +	container = group->container;
> >>> +	driver = container->iommu_driver;
> >>> +	if (likely(driver && driver->ops->unregister_handler))
> >>> +		ret = driver->ops->unregister_handler(container->iommu_data, dev);
> >>> +	else
> >>> +		ret = -ENOTTY;
> >>> +
> >>> +	vfio_group_try_dissolve_container(group);
> >>> +
> >>> +out:
> >>> +	vfio_group_put(group);
> >>> +	return ret;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
> >>> +
> >>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
> >>> +{
> >>> +	struct vfio_device *device = dev_get_drvdata(dev);
> >>> +
> >>> +	if (unlikely(!device->ops->transfer))
> >>> +		return -EOPNOTSUPP;
> >>> +
> >>> +	return device->ops->transfer(device->device_data, fault);
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
> >>> +
> >>>  /**
> >>>   * Module/class support
> >>>   */
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index ba2b5a1cf6e9..9d1adeddb303 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
> >>>  	struct vfio_batch batch;
> >>>  	struct vfio_range *range;
> >>>  	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
> >>> -	int access_flags = 0;
> >>> +	int access_flags = 0, nested;
> >>>  	size_t premap_len, map_len, mapped_len = 0;
> >>>  	unsigned long bit_offset, vaddr, pfn, i, npages;
> >>>  	int ret;
> >>>  	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
> >>>  	struct iommu_page_response resp = {0};
> >>>  
> >>> +	if (vfio_dev_domian_nested(dev, &nested))
> >>> +		return -ENODEV;
> >>> +
> >>> +	/*
> >>> +	 * When configured in nested mode, further deliver the
> >>> +	 * stage/level 1 faults to the guest.
> >>> +	 */
> >>> +	if (nested) {
> >>> +		bool l2;
> >>> +
> >>> +		if (fault->type == IOMMU_FAULT_PAGE_REQ)
> >>> +			l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
> >>> +		if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
> >>> +			l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
> >>> +
> >>> +		if (!l2)
> >>> +			return vfio_transfer_iommu_fault(dev, fault);
> >>> +	}
> >>> +
> >>>  	if (fault->type != IOMMU_FAULT_PAGE_REQ)
> >>>  		return -EOPNOTSUPP;
> >>>  
> >>> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
> >>>  	wake_up_all(&iommu->vaddr_wait);
> >>>  }
> >>>  
> >>> +static int vfio_iommu_type1_register_handler(void *iommu_data,
> >>> +					     struct device *dev)
> >>> +{
> >>> +	struct vfio_iommu *iommu = iommu_data;
> >>> +
> >>> +	if (iommu->iopf_enabled)
> >>> +		return iommu_update_device_fault_handler(dev, ~0,
> >>> +						FAULT_REPORT_NESTED_L1);
> >>> +	else
> >>> +		return iommu_register_device_fault_handler(dev,
> >>> +						vfio_iommu_type1_dma_map_iopf,
> >>> +						FAULT_REPORT_NESTED_L1, dev);
> >>> +}
> >>> +
> >>> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
> >>> +					       struct device *dev)
> >>> +{
> >>> +	struct vfio_iommu *iommu = iommu_data;
> >>> +
> >>> +	if (iommu->iopf_enabled)
> >>> +		return iommu_update_device_fault_handler(dev,
> >>> +						~FAULT_REPORT_NESTED_L1, 0);
> >>> +	else
> >>> +		return iommu_unregister_device_fault_handler(dev);
> >>> +}  
> >>
> >>
> >> The path through vfio to register this is pretty ugly, but I don't see
> >> any reason for the the update interfaces here, the previously
> >> registered handler just changes its behavior.  
> > 
> > Yeah, this seems not an elegant way...
> > 
> > If IOPF(L2) enabled, the fault handler has already been registered, so for
> > nested mode setup, we only need to change the flags of the handler in the
> > IOMMU driver to receive L1 faults.
> > (assume that L1 IOPF is configured after L2 IOPF)
> > 
> > Currently each device can only have one iommu dev fault handler, and L1
> > and L2 IOPF are configured separately in nested mode, I am also wondering
> > that is there a better solution for this.  

I haven't fully read all the references, but who imposes the fact that
there's only one fault handler per device?  If type1 could register one
handler and the vfio-pci bus driver another for the other level, would
we need this path through vfio-core?

> Let me simply add, maybe there is another way for this:
> Would it be better to set host IOPF enabled (L2 faults) in the VFIO_IOMMU_MAP_DMA
> ioctl (no need to add a new ioctl, and we can specify whether this mapping is IOPF
> available or statically pinned), and set guest IOPF enabled (L1 faults) in the
> VFIO_IOMMU_SET_PASID_TABLE (from Eric's series) ioctl?
> And we have no requirement for the sequence of these two ioctls. The first called
> one will register the handler, and the later one will just update the handler...

This is looking more and more like it belongs with the IOASID work.  I
think Eric has shifted his focus there too.  Thanks,

Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF
  2021-05-21  6:38     ` Shenming Lu
@ 2021-05-24 22:11       ` Alex Williamson
  2021-05-27 11:15         ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-24 22:11 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 21 May 2021 14:38:25 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> On 2021/5/19 2:58, Alex Williamson wrote:
> > On Fri, 9 Apr 2021 11:44:17 +0800
> > Shenming Lu <lushenming@huawei.com> wrote:
> >   
> >> Since enabling IOPF for devices may lead to a slow ramp up of performance,
> >> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
> >> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
> >> registering the VFIO IOPF handler.
> >>
> >> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
> >> inflight page faults when disabling.
> >>
> >> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
> >>  include/uapi/linux/vfio.h       |   6 +
> >>  2 files changed, 226 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 01e296c6dc9e..7df5711e743a 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -71,6 +71,7 @@ struct vfio_iommu {
> >>  	struct rb_root		dma_list;
> >>  	struct blocking_notifier_head notifier;
> >>  	struct mmu_notifier	mn;
> >> +	struct mm_struct	*mm;  
> > 
> > We currently have no requirement that a single mm is used for all
> > container mappings.  Does enabling IOPF impose such a requirement?
> > Shouldn't MAP/UNMAP enforce that?  
> 
> Did you mean that there is a possibility that each vfio_dma in a
> container has a different mm? If so, could we register one MMU
> notifier for each vfio_dma in the MAP ioctl?

We don't prevent it currently.  There needs to be some balance,
typically we'll have one mm, so would it make sense to have potentially
thousands of mmu notifiers registered against the same mm?
 
> >>  	unsigned int		dma_avail;
> >>  	unsigned int		vaddr_invalid_count;
> >>  	uint64_t		pgsize_bitmap;
> >> @@ -81,6 +82,7 @@ struct vfio_iommu {
> >>  	bool			dirty_page_tracking;
> >>  	bool			pinned_page_dirty_scope;
> >>  	bool			container_open;
> >> +	bool			iopf_enabled;
> >>  };
> >>  
> >>  struct vfio_domain {
> >> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
> >>  	return node ? iopf_group : NULL;
> >>  }
> >>  
> >> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
> >> +{
> >> +	struct rb_node **link, *parent = NULL;
> >> +	struct vfio_iopf_group *iopf_group;
> >> +
> >> +	mutex_lock(&iopf_group_list_lock);
> >> +
> >> +	link = &iopf_group_list.rb_node;
> >> +
> >> +	while (*link) {
> >> +		parent = *link;
> >> +		iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
> >> +
> >> +		if (new->iommu_group < iopf_group->iommu_group)
> >> +			link = &(*link)->rb_left;
> >> +		else
> >> +			link = &(*link)->rb_right;
> >> +	}
> >> +
> >> +	rb_link_node(&new->node, parent, link);
> >> +	rb_insert_color(&new->node, &iopf_group_list);
> >> +
> >> +	mutex_unlock(&iopf_group_list_lock);
> >> +}
> >> +
> >> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
> >> +{
> >> +	mutex_lock(&iopf_group_list_lock);
> >> +	rb_erase(&old->node, &iopf_group_list);
> >> +	mutex_unlock(&iopf_group_list_lock);
> >> +}
> >> +
> >>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >>  {
> >>  	struct mm_struct *mm;
> >> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
> >>  	list_splice_tail(iova_copy, iova);
> >>  }
> >>  
> >> +static int vfio_dev_domian_nested(struct device *dev, int *nested)  
> > 
> > 
> > s/domian/domain/
> > 
> >   
> >> +{
> >> +	struct iommu_domain *domain;
> >> +
> >> +	domain = iommu_get_domain_for_dev(dev);
> >> +	if (!domain)
> >> +		return -ENODEV;
> >> +
> >> +	return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);  
> > 
> > 
> > Wouldn't this be easier to use if it returned bool, false on -errno?  
> 
> Make sense.
> 
> > 
> >   
> >> +}
> >> +
> >> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
> >> +
> >> +static int dev_enable_iopf(struct device *dev, void *data)
> >> +{
> >> +	int *enabled_dev_cnt = data;
> >> +	int nested;
> >> +	u32 flags;
> >> +	int ret;
> >> +
> >> +	ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	ret = vfio_dev_domian_nested(dev, &nested);
> >> +	if (ret)
> >> +		goto out_disable;
> >> +
> >> +	if (nested)
> >> +		flags = FAULT_REPORT_NESTED_L2;
> >> +	else
> >> +		flags = FAULT_REPORT_FLAT;
> >> +
> >> +	ret = iommu_register_device_fault_handler(dev,
> >> +				vfio_iommu_type1_dma_map_iopf, flags, dev);
> >> +	if (ret)
> >> +		goto out_disable;
> >> +
> >> +	(*enabled_dev_cnt)++;
> >> +	return 0;
> >> +
> >> +out_disable:
> >> +	iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
> >> +	return ret;
> >> +}
> >> +
> >> +static int dev_disable_iopf(struct device *dev, void *data)
> >> +{
> >> +	int *enabled_dev_cnt = data;
> >> +
> >> +	if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
> >> +		return -1;  
> > 
> > 
> > Use an errno.>
> >   
> >> +
> >> +	WARN_ON(iommu_unregister_device_fault_handler(dev));
> >> +	WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
> >> +
> >> +	if (enabled_dev_cnt)
> >> +		(*enabled_dev_cnt)--;  
> > 
> > 
> > Why do we need these counters?  
> 
> We use this counter to record the number of IOPF enabled devices, and if
> any device fails to be enabled, we will exit the loop (three nested loop)
> and go to the unwind process, which will disable the first @enabled_dev_cnt
> devices.
> 
> >   
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  					 struct iommu_group *iommu_group)
> >>  {
> >> @@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	struct iommu_domain_geometry geo;
> >>  	LIST_HEAD(iova_copy);
> >>  	LIST_HEAD(group_resv_regions);
> >> +	int iopf_enabled_dev_cnt = 0;
> >> +	struct vfio_iopf_group *iopf_group = NULL;
> >>  
> >>  	mutex_lock(&iommu->lock);
> >>  
> >> @@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	if (ret)
> >>  		goto out_domain;
> >>  
> >> +	if (iommu->iopf_enabled) {
> >> +		ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
> >> +					       dev_enable_iopf);
> >> +		if (ret)
> >> +			goto out_detach;
> >> +
> >> +		iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
> >> +		if (!iopf_group) {
> >> +			ret = -ENOMEM;
> >> +			goto out_detach;
> >> +		}
> >> +
> >> +		iopf_group->iommu_group = iommu_group;
> >> +		iopf_group->iommu = iommu;
> >> +
> >> +		vfio_link_iopf_group(iopf_group);  
> > 
> > This seems backwards, once we enable iopf we can take a fault, so the
> > structure to lookup the data for the device should be setup first.  I'm
> > still not convinced this iopf_group rb tree is a good solution to get
> > from the device to the iommu either.  vfio-core could traverse dev ->
> > iommu_group -> vfio_group -> container -> iommu_data.  
> 
> Make sense.
> 
> > 
> >   
> >> +	}
> >> +
> >>  	/* Get aperture info */
> >>  	iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
> >>  
> >> @@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	vfio_test_domain_fgsp(domain);
> >>  
> >>  	/* replay mappings on new domains */
> >> -	ret = vfio_iommu_replay(iommu, domain);
> >> -	if (ret)
> >> -		goto out_detach;
> >> +	if (!iommu->iopf_enabled) {
> >> +		ret = vfio_iommu_replay(iommu, domain);
> >> +		if (ret)
> >> +			goto out_detach;
> >> +	}  
> > 
> > Test in vfio_iommu_replay()?  
> 
> Not yet, I will do more tests later.
> 
> >   
> >>  
> >>  	if (resv_msi) {
> >>  		ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
> >> @@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	iommu_domain_free(domain->domain);
> >>  	vfio_iommu_iova_free(&iova_copy);
> >>  	vfio_iommu_resv_free(&group_resv_regions);
> >> +	if (iommu->iopf_enabled) {
> >> +		if (iopf_group) {
> >> +			vfio_unlink_iopf_group(iopf_group);
> >> +			kfree(iopf_group);
> >> +		}
> >> +
> >> +		iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
> >> +					 dev_disable_iopf);  
> > 
> > Backwards, disable fault reporting before unlinking lookup.  
> 
> Make sense.
> 
> >   
> >> +	}
> >>  out_free:
> >>  	kfree(domain);
> >>  	kfree(group);
> >> @@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> >>  		if (!group)
> >>  			continue;
> >>  
> >> +		if (iommu->iopf_enabled) {
> >> +			struct vfio_iopf_group *iopf_group;
> >> +
> >> +			iopf_group = vfio_find_iopf_group(iommu_group);
> >> +			if (!WARN_ON(!iopf_group)) {
> >> +				vfio_unlink_iopf_group(iopf_group);
> >> +				kfree(iopf_group);
> >> +			}
> >> +
> >> +			iommu_group_for_each_dev(iommu_group, NULL,
> >> +						 dev_disable_iopf);  
> > 
> > 
> > Backwards.  Also unregistering the fault handler can fail if there are
> > pending faults.  It appears the fault handler and iopf lookup could also
> > race with this function, ex. fault handler gets an iopf object before
> > it's removed from the rb-tree, blocks trying to acquire iommu->lock,
> > disable_iopf fails, detach proceeds, fault handler has use after free
> > error.  
> 
> We have WARN_ON for the failure of the unregistering.
> 
> Yeah, it seems that using vfio_iopf_group is not a good choice...
> 
> > 
> >   
> >> +		}
> >> +
> >>  		vfio_iommu_detach_group(domain, group);
> >>  		update_dirty_scope = !group->pinned_page_dirty_scope;
> >>  		list_del(&group->next);
> >> @@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
> >>  
> >>  	vfio_iommu_iova_free(&iommu->iova_list);
> >>  
> >> +	if (iommu->iopf_enabled) {
> >> +		mmu_notifier_unregister(&iommu->mn, iommu->mm);
> >> +		mmdrop(iommu->mm);
> >> +	}
> >> +
> >>  	kfree(iommu);
> >>  }
> >>  
> >> @@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
> >>  	.invalidate_range	= mn_invalidate_range,
> >>  };
> >>  
> >> +static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
> >> +{
> >> +	struct vfio_domain *d;
> >> +	struct vfio_group *g;
> >> +	struct vfio_iopf_group *iopf_group;
> >> +	int enabled_dev_cnt = 0;
> >> +	int ret;
> >> +
> >> +	if (!current->mm)
> >> +		return -ENODEV;
> >> +
> >> +	mutex_lock(&iommu->lock);
> >> +
> >> +	mmgrab(current->mm);  
> > 
> > 
> > As above, this is a new and undocumented requirement.
> > 
> >   
> >> +	iommu->mm = current->mm;
> >> +	iommu->mn.ops = &vfio_iommu_type1_mn_ops;
> >> +	ret = mmu_notifier_register(&iommu->mn, current->mm);
> >> +	if (ret)
> >> +		goto out_drop;
> >> +
> >> +	list_for_each_entry(d, &iommu->domain_list, next) {
> >> +		list_for_each_entry(g, &d->group_list, next) {
> >> +			ret = iommu_group_for_each_dev(g->iommu_group,
> >> +					&enabled_dev_cnt, dev_enable_iopf);
> >> +			if (ret)
> >> +				goto out_unwind;
> >> +
> >> +			iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
> >> +			if (!iopf_group) {
> >> +				ret = -ENOMEM;
> >> +				goto out_unwind;
> >> +			}
> >> +
> >> +			iopf_group->iommu_group = g->iommu_group;
> >> +			iopf_group->iommu = iommu;
> >> +
> >> +			vfio_link_iopf_group(iopf_group);
> >> +		}
> >> +	}
> >> +
> >> +	iommu->iopf_enabled = true;
> >> +	goto out_unlock;
> >> +
> >> +out_unwind:
> >> +	list_for_each_entry(d, &iommu->domain_list, next) {
> >> +		list_for_each_entry(g, &d->group_list, next) {
> >> +			iopf_group = vfio_find_iopf_group(g->iommu_group);
> >> +			if (iopf_group) {
> >> +				vfio_unlink_iopf_group(iopf_group);
> >> +				kfree(iopf_group);
> >> +			}
> >> +
> >> +			if (iommu_group_for_each_dev(g->iommu_group,
> >> +					&enabled_dev_cnt, dev_disable_iopf))
> >> +				goto out_unregister;  
> > 
> > 
> > This seems broken, we break out of the unwind if any group_for_each
> > fails??  
> 
> The iommu_group_for_each_dev function will return a negative value if
> we have disabled all previously enabled devices (@enabled_dev_cnt).

What's the harm in calling disable IOPF on a device where it's already
disabled?  This is why I'm not sure the value of keeping a counter.  It
also presumes that for_each_dev iterates in the same order every time.
 
> >> +		}
> >> +	}
> >> +
> >> +out_unregister:
> >> +	mmu_notifier_unregister(&iommu->mn, current->mm);
> >> +
> >> +out_drop:
> >> +	iommu->mm = NULL;
> >> +	mmdrop(current->mm);
> >> +
> >> +out_unlock:
> >> +	mutex_unlock(&iommu->lock);
> >> +	return ret;
> >> +}  
> > 
> > 
> > Is there an assumption that the user does this before performing any
> > mapping?  I don't see where current mappings would be converted which
> > means we'll have pinned pages leaked and vfio_dma objects without a
> > pinned page bitmap.  
> 
> Yeah, we have an assumption that this ioctl (ENABLE_IOPF) must be called
> before any MAP/UNMAP ioctl...
> 
> I will document these later. :-)


That can't be fixed with documentation, if it's a requirement, the code
needs to enforce it.  But also why is it a requirement?  Theoretically
we could unpin and de-account existing mappings and allow the mmu
notifier to handle these as well, right?  Thanks,

Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler
  2021-05-21  6:38     ` Shenming Lu
@ 2021-05-24 22:11       ` Alex Williamson
  2021-05-27 11:16         ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-24 22:11 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 21 May 2021 14:38:52 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> On 2021/5/19 2:58, Alex Williamson wrote:
> > On Fri, 9 Apr 2021 11:44:14 +0800
> > Shenming Lu <lushenming@huawei.com> wrote:
> >   
> >> VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
> >> for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
> >> serve the reported page faults from the IOMMU driver.
> >>
> >> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 114 ++++++++++++++++++++++++++++++++
> >>  1 file changed, 114 insertions(+)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 45cbfd4879a5..ab0ff60ee207 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -101,6 +101,7 @@ struct vfio_dma {
> >>  	struct task_struct	*task;
> >>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>  	unsigned long		*bitmap;
> >> +	unsigned long		*iopf_mapped_bitmap;
> >>  };
> >>  
> >>  struct vfio_batch {
> >> @@ -141,6 +142,16 @@ struct vfio_regions {
> >>  	size_t len;
> >>  };
> >>  
> >> +/* A global IOPF enabled group list */
> >> +static struct rb_root iopf_group_list = RB_ROOT;
> >> +static DEFINE_MUTEX(iopf_group_list_lock);
> >> +
> >> +struct vfio_iopf_group {
> >> +	struct rb_node		node;
> >> +	struct iommu_group	*iommu_group;
> >> +	struct vfio_iommu	*iommu;
> >> +};
> >> +
> >>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> >>  					(!list_empty(&iommu->domain_list))
> >>  
> >> @@ -157,6 +168,10 @@ struct vfio_regions {
> >>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
> >>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> >>  
> >> +#define IOPF_MAPPED_BITMAP_GET(dma, i)	\
> >> +			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
> >> +			       >> ((i) % BITS_PER_LONG)) & 0x1)  
> > 
> > 
> > Can't we just use test_bit()?  
> 
> Yeah, we can use it.
> 
> > 
> >   
> >> +
> >>  #define WAITED 1
> >>  
> >>  static int put_pfn(unsigned long pfn, int prot);
> >> @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
> >>  	return ret;
> >>  }
> >>  
> >> +/*
> >> + * Helper functions for iopf_group_list
> >> + */
> >> +static struct vfio_iopf_group *
> >> +vfio_find_iopf_group(struct iommu_group *iommu_group)
> >> +{
> >> +	struct vfio_iopf_group *iopf_group;
> >> +	struct rb_node *node;
> >> +
> >> +	mutex_lock(&iopf_group_list_lock);
> >> +
> >> +	node = iopf_group_list.rb_node;
> >> +
> >> +	while (node) {
> >> +		iopf_group = rb_entry(node, struct vfio_iopf_group, node);
> >> +
> >> +		if (iommu_group < iopf_group->iommu_group)
> >> +			node = node->rb_left;
> >> +		else if (iommu_group > iopf_group->iommu_group)
> >> +			node = node->rb_right;
> >> +		else
> >> +			break;
> >> +	}
> >> +
> >> +	mutex_unlock(&iopf_group_list_lock);
> >> +	return node ? iopf_group : NULL;
> >> +}  
> > 
> > This looks like a pretty heavy weight operation per DMA fault.
> > 
> > I'm also suspicious of this validity of this iopf_group after we've
> > dropped the locking, the ordering of patches makes this very confusing.  
> 
> My thought was to include the handling of DMA faults completely in the type1
> backend by introducing the vfio_iopf_group struct. But it seems that introducing
> a struct with an unknown lifecycle causes more problems...
> I will use the path from vfio-core as in the v2 for simplicity and validity.
> 
> Sorry for the confusing, I will reconstruct the series later. :-)
> 
> >   
> >> +
> >>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >>  {
> >>  	struct mm_struct *mm;
> >> @@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
> >>  	return -EINVAL;
> >>  }
> >>  
> >> +/* VFIO I/O Page Fault handler */
> >> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)  
> >   
> >>From the comment, this seems like the IOMMU fault handler (the  
> > construction of this series makes this difficult to follow) and
> > eventually it handles more than DMA mapping, for example transferring
> > faults to the device driver.  "dma_map_iopf" seems like a poorly scoped
> > name.  
> 
> Maybe just call it dev_fault_handler?

Better.

> >> +{
> >> +	struct device *dev = (struct device *)data;
> >> +	struct iommu_group *iommu_group;
> >> +	struct vfio_iopf_group *iopf_group;
> >> +	struct vfio_iommu *iommu;
> >> +	struct vfio_dma *dma;
> >> +	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
> >> +	int access_flags = 0;
> >> +	unsigned long bit_offset, vaddr, pfn;
> >> +	int ret;
> >> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
> >> +	struct iommu_page_response resp = {0};
> >> +
> >> +	if (fault->type != IOMMU_FAULT_PAGE_REQ)
> >> +		return -EOPNOTSUPP;
> >> +
> >> +	iommu_group = iommu_group_get(dev);
> >> +	if (!iommu_group)
> >> +		return -ENODEV;
> >> +
> >> +	iopf_group = vfio_find_iopf_group(iommu_group);
> >> +	iommu_group_put(iommu_group);
> >> +	if (!iopf_group)
> >> +		return -ENODEV;
> >> +
> >> +	iommu = iopf_group->iommu;
> >> +
> >> +	mutex_lock(&iommu->lock);  
> > 
> > Again, I'm dubious of our ability to grab this lock from an object with
> > an unknown lifecycle and races we might have with that group being
> > detached or DMA unmapped.  Also, how effective is enabling IOMMU page
> > faulting if we're serializing all faults within a container context?  
> 
> Did you mean "efficient"?

Yes, that's more appropriate.

> I also worry about this as the mapping and unmapping of the faulting pages
> are all with the same lock...
> Is there a way to parallel them? Or could we have more fine grained lock
> control?

It seems we need it; the current locking is designed for static
mappings by the user, therefore concurrency hasn't been a priority.
This again touches how far we want to extend type1 in the direction
we intend to go with SVA/PASID support in IOASID.  Thanks,

Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
  2021-05-21  6:37   ` Shenming Lu
@ 2021-05-24 22:11     ` Alex Williamson
  2021-05-27 11:25       ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2021-05-24 22:11 UTC (permalink / raw)
  To: Shenming Lu
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On Fri, 21 May 2021 14:37:21 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> Hi Alex,
> 
> On 2021/5/19 2:57, Alex Williamson wrote:
> > On Fri, 9 Apr 2021 11:44:12 +0800
> > Shenming Lu <lushenming@huawei.com> wrote:
> >   
> >> Hi,
> >>
> >> Requesting for your comments and suggestions. :-)
> >>
> >> The static pinning and mapping problem in VFIO and possible solutions
> >> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> >> Page Fault support for VFIO devices. Different from those relatively
> >> complicated software approaches such as presenting a vIOMMU that provides
> >> the DMA buffer information (might include para-virtualized optimizations),
> >> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> >> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> >> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> >> support for VFIO passthrough based on the IOPF part of SVA in this series.  
> > 
> > The SVA proposals are being reworked to make use of a new IOASID
> > object, it's not clear to me that this shouldn't also make use of that
> > work as it does a significant expansion of the type1 IOMMU with fault
> > handlers that would duplicate new work using that new model.  
> 
> It seems that the IOASID extension for guest SVA would not affect this series,
> will we do any host-guest IOASID translation in the VFIO fault handler?

Surely it will, we don't currently have any IOMMU fault handling or
forwarding of IOMMU faults through to the vfio bus driver, both of
those would be included in an IOASID implementation.  I think Jason's
vision is to use IOASID to deprecate type1 for all use cases, so even
if we were to choose to implement IOPF in type1 we should agree on
common interfaces with IOASID.
 
> >> We have measured its performance with UADK [4] (passthrough an accelerator
> >> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
> >>
> >> Run hisi_sec_test...
> >>  - with varying sending times and message lengths
> >>  - with/without IOPF enabled (speed slowdown)
> >>
> >> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
> >>             slowdown (num of faults)
> >>  times      VFIO IOPF      host SVA
> >>  1          63.4% (518)    82.8% (512)
> >>  100        22.9% (1058)   47.9% (1024)
> >>  1000       2.6% (1071)    8.5% (1024)
> >>
> >> when msg_len = 10MB (and PREMAP_LEN = 512):
> >>             slowdown (num of faults)
> >>  times      VFIO IOPF
> >>  1          32.6% (13)
> >>  100        3.5% (26)
> >>  1000       1.6% (26)  
> > 
> > It seems like this is only an example that you can make a benchmark
> > show anything you want.  The best results would be to pre-map
> > everything, which is what we have without this series.  What is an
> > acceptable overhead to incur to avoid page pinning?  What if userspace
> > had more fine grained control over which mappings were available for
> > faulting and which were statically mapped?  I don't really see what
> > sense the pre-mapping range makes.  If I assume the user is QEMU in a
> > non-vIOMMU configuration, pre-mapping the beginning of each RAM section
> > doesn't make any logical sense relative to device DMA.  
> 
> As you said in Patch 4, we can introduce full end-to-end functionality
> before trying to improve performance, and I will drop the pre-mapping patch
> in the current stage...
> 
> Is there a need that userspace wants more fine grained control over which
> mappings are available for faulting? If so, we may evolve the MAP ioctl
> to support for specifying the faulting range.

You're essentially enabling this for a vfio bus driver via patch 7/8,
pinning for selective DMA faulting.  How would a driver in userspace
make equivalent requests?  In the case of performance, the user could
mlock the page but they have no mechanism here to pre-fault it.  Should
they?

> As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
> (and page faults occur almost only when first accessing), and it seems that
> there is a large optimization space compared to CPU page faulting.

Yes, there's of course going to be overhead in terms of latency for the
page faults.  My point was more that when a host is not under memory
pressure we should trend towards the performance of pinned, static
mappings and we should be able to create arbitrarily large pre-fault
behavior to show that.  But I think what we really want to enable via
IOPF is density, right?  Therefore how many more assigned device guests
can you run on a host with IOPF?  How does the slope, plateau, and
inflection point of their aggregate throughput compare to static
pinning?  VM startup time is probably also a useful metric, ie. trading
device latency for startup latency.  Thanks,

Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
  2021-05-24 22:11         ` Alex Williamson
@ 2021-05-27 11:03           ` Shenming Lu
  2021-05-27 11:18             ` Lu Baolu
  0 siblings, 1 reply; 40+ messages in thread
From: Shenming Lu @ 2021-05-27 11:03 UTC (permalink / raw)
  To: Alex Williamson, Lu Baolu
  Cc: Eric Auger, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, yi.l.liu,
	Christoph Hellwig, Jonathan Cameron, Barry Song, wanghaibin.wang,
	yuzenghui

On 2021/5/25 6:11, Alex Williamson wrote:
> On Mon, 24 May 2021 21:11:11 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> On 2021/5/21 15:59, Shenming Lu wrote:
>>> On 2021/5/19 2:58, Alex Williamson wrote:  
>>>> On Fri, 9 Apr 2021 11:44:20 +0800
>>>> Shenming Lu <lushenming@huawei.com> wrote:
>>>>  
>>>>> To set up nested mode, drivers such as vfio_pci need to register a
>>>>> handler to receive stage/level 1 faults from the IOMMU, but since
>>>>> currently each device can only have one iommu dev fault handler,
>>>>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
>>>>> we choose to update the registered handler (a consolidated one) via
>>>>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
>>>>> stage 1 faults in the handler to the guest through a newly added
>>>>> vfio_device_ops callback.  
>>>>
>>>> Are there proposed in-kernel drivers that would use any of these
>>>> symbols?  
>>>
>>> I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
>>> use these symbols to consolidate the two page fault handlers into one.
>>>
>>> [1] https://patchwork.kernel.org/project/kvm/cover/20210411114659.15051-1-eric.auger@redhat.com/
>>>   
>>>>  
>>>>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
>>>>> ---
>>>>>  drivers/vfio/vfio.c             | 81 +++++++++++++++++++++++++++++++++
>>>>>  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
>>>>>  include/linux/vfio.h            | 12 +++++
>>>>>  3 files changed, 141 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>> index 44c8dfabf7de..4245f15914bf 100644
>>>>> --- a/drivers/vfio/vfio.c
>>>>> +++ b/drivers/vfio/vfio.c
>>>>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
>>>>>  }
>>>>>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>>>>>  
>>>>> +/*
>>>>> + * Register/Update the VFIO IOPF handler to receive
>>>>> + * nested stage/level 1 faults.
>>>>> + */
>>>>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
>>>>> +{
>>>>> +	struct vfio_container *container;
>>>>> +	struct vfio_group *group;
>>>>> +	struct vfio_iommu_driver *driver;
>>>>> +	int ret;
>>>>> +
>>>>> +	if (!dev)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	group = vfio_group_get_from_dev(dev);
>>>>> +	if (!group)
>>>>> +		return -ENODEV;
>>>>> +
>>>>> +	ret = vfio_group_add_container_user(group);
>>>>> +	if (ret)
>>>>> +		goto out;
>>>>> +
>>>>> +	container = group->container;
>>>>> +	driver = container->iommu_driver;
>>>>> +	if (likely(driver && driver->ops->register_handler))
>>>>> +		ret = driver->ops->register_handler(container->iommu_data, dev);
>>>>> +	else
>>>>> +		ret = -ENOTTY;
>>>>> +
>>>>> +	vfio_group_try_dissolve_container(group);
>>>>> +
>>>>> +out:
>>>>> +	vfio_group_put(group);
>>>>> +	return ret;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
>>>>> +
>>>>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
>>>>> +{
>>>>> +	struct vfio_container *container;
>>>>> +	struct vfio_group *group;
>>>>> +	struct vfio_iommu_driver *driver;
>>>>> +	int ret;
>>>>> +
>>>>> +	if (!dev)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	group = vfio_group_get_from_dev(dev);
>>>>> +	if (!group)
>>>>> +		return -ENODEV;
>>>>> +
>>>>> +	ret = vfio_group_add_container_user(group);
>>>>> +	if (ret)
>>>>> +		goto out;
>>>>> +
>>>>> +	container = group->container;
>>>>> +	driver = container->iommu_driver;
>>>>> +	if (likely(driver && driver->ops->unregister_handler))
>>>>> +		ret = driver->ops->unregister_handler(container->iommu_data, dev);
>>>>> +	else
>>>>> +		ret = -ENOTTY;
>>>>> +
>>>>> +	vfio_group_try_dissolve_container(group);
>>>>> +
>>>>> +out:
>>>>> +	vfio_group_put(group);
>>>>> +	return ret;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
>>>>> +
>>>>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
>>>>> +{
>>>>> +	struct vfio_device *device = dev_get_drvdata(dev);
>>>>> +
>>>>> +	if (unlikely(!device->ops->transfer))
>>>>> +		return -EOPNOTSUPP;
>>>>> +
>>>>> +	return device->ops->transfer(device->device_data, fault);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
>>>>> +
>>>>>  /**
>>>>>   * Module/class support
>>>>>   */
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index ba2b5a1cf6e9..9d1adeddb303 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>>>>>  	struct vfio_batch batch;
>>>>>  	struct vfio_range *range;
>>>>>  	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
>>>>> -	int access_flags = 0;
>>>>> +	int access_flags = 0, nested;
>>>>>  	size_t premap_len, map_len, mapped_len = 0;
>>>>>  	unsigned long bit_offset, vaddr, pfn, i, npages;
>>>>>  	int ret;
>>>>>  	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>>>>>  	struct iommu_page_response resp = {0};
>>>>>  
>>>>> +	if (vfio_dev_domian_nested(dev, &nested))
>>>>> +		return -ENODEV;
>>>>> +
>>>>> +	/*
>>>>> +	 * When configured in nested mode, further deliver the
>>>>> +	 * stage/level 1 faults to the guest.
>>>>> +	 */
>>>>> +	if (nested) {
>>>>> +		bool l2;
>>>>> +
>>>>> +		if (fault->type == IOMMU_FAULT_PAGE_REQ)
>>>>> +			l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
>>>>> +		if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
>>>>> +			l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
>>>>> +
>>>>> +		if (!l2)
>>>>> +			return vfio_transfer_iommu_fault(dev, fault);
>>>>> +	}
>>>>> +
>>>>>  	if (fault->type != IOMMU_FAULT_PAGE_REQ)
>>>>>  		return -EOPNOTSUPP;
>>>>>  
>>>>> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
>>>>>  	wake_up_all(&iommu->vaddr_wait);
>>>>>  }
>>>>>  
>>>>> +static int vfio_iommu_type1_register_handler(void *iommu_data,
>>>>> +					     struct device *dev)
>>>>> +{
>>>>> +	struct vfio_iommu *iommu = iommu_data;
>>>>> +
>>>>> +	if (iommu->iopf_enabled)
>>>>> +		return iommu_update_device_fault_handler(dev, ~0,
>>>>> +						FAULT_REPORT_NESTED_L1);
>>>>> +	else
>>>>> +		return iommu_register_device_fault_handler(dev,
>>>>> +						vfio_iommu_type1_dma_map_iopf,
>>>>> +						FAULT_REPORT_NESTED_L1, dev);
>>>>> +}
>>>>> +
>>>>> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
>>>>> +					       struct device *dev)
>>>>> +{
>>>>> +	struct vfio_iommu *iommu = iommu_data;
>>>>> +
>>>>> +	if (iommu->iopf_enabled)
>>>>> +		return iommu_update_device_fault_handler(dev,
>>>>> +						~FAULT_REPORT_NESTED_L1, 0);
>>>>> +	else
>>>>> +		return iommu_unregister_device_fault_handler(dev);
>>>>> +}  
>>>>
>>>>
>>>> The path through vfio to register this is pretty ugly, but I don't see
>>>> any reason for the the update interfaces here, the previously
>>>> registered handler just changes its behavior.  
>>>
>>> Yeah, this seems not an elegant way...
>>>
>>> If IOPF(L2) enabled, the fault handler has already been registered, so for
>>> nested mode setup, we only need to change the flags of the handler in the
>>> IOMMU driver to receive L1 faults.
>>> (assume that L1 IOPF is configured after L2 IOPF)
>>>
>>> Currently each device can only have one iommu dev fault handler, and L1
>>> and L2 IOPF are configured separately in nested mode, I am also wondering
>>> that is there a better solution for this.  
> 
> I haven't fully read all the references, but who imposes the fact that
> there's only one fault handler per device?  If type1 could register one
> handler and the vfio-pci bus driver another for the other level, would
> we need this path through vfio-core?

If we could register more than one handler per device, things would become
much more simple, and the path through vfio-core would not be needed.

Hi Baolu,
Is there any restriction for having more than one handler per device?

> 
>> Let me simply add, maybe there is another way for this:
>> Would it be better to set host IOPF enabled (L2 faults) in the VFIO_IOMMU_MAP_DMA
>> ioctl (no need to add a new ioctl, and we can specify whether this mapping is IOPF
>> available or statically pinned), and set guest IOPF enabled (L1 faults) in the
>> VFIO_IOMMU_SET_PASID_TABLE (from Eric's series) ioctl?
>> And we have no requirement for the sequence of these two ioctls. The first called
>> one will register the handler, and the later one will just update the handler...
> 
> This is looking more and more like it belongs with the IOASID work.  I
> think Eric has shifted his focus there too.  Thanks,

I will pay more attention to the IOASID work.

Thanks,
Shenming

> 
> Alex
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF
  2021-05-24 22:11       ` Alex Williamson
@ 2021-05-27 11:15         ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-05-27 11:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:38:25 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> On 2021/5/19 2:58, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:17 +0800
>>> Shenming Lu <lushenming@huawei.com> wrote:
>>>   
>>>> Since enabling IOPF for devices may lead to a slow ramp up of performance,
>>>> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
>>>> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
>>>> registering the VFIO IOPF handler.
>>>>
>>>> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
>>>> inflight page faults when disabling.
>>>>
>>>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
>>>>  include/uapi/linux/vfio.h       |   6 +
>>>>  2 files changed, 226 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index 01e296c6dc9e..7df5711e743a 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -71,6 +71,7 @@ struct vfio_iommu {
>>>>  	struct rb_root		dma_list;
>>>>  	struct blocking_notifier_head notifier;
>>>>  	struct mmu_notifier	mn;
>>>> +	struct mm_struct	*mm;  
>>>
>>> We currently have no requirement that a single mm is used for all
>>> container mappings.  Does enabling IOPF impose such a requirement?
>>> Shouldn't MAP/UNMAP enforce that?  
>>
>> Did you mean that there is a possibility that each vfio_dma in a
>> container has a different mm? If so, could we register one MMU
>> notifier for each vfio_dma in the MAP ioctl?
> 
> We don't prevent it currently.  There needs to be some balance,
> typically we'll have one mm, so would it make sense to have potentially
> thousands of mmu notifiers registered against the same mm?

If we have one MMU notifier per vfio_dma, there is no need to search the
iommu->dma_list when receiving an address invalidation event in
mn_invalidate_range().
Or could we divide the vfio_dma ranges into parts with different mm, and
each part would have one MMU notifier?

>  
>>>>  	unsigned int		dma_avail;
>>>>  	unsigned int		vaddr_invalid_count;
>>>>  	uint64_t		pgsize_bitmap;
>>>> @@ -81,6 +82,7 @@ struct vfio_iommu {
>>>>  	bool			dirty_page_tracking;
>>>>  	bool			pinned_page_dirty_scope;
>>>>  	bool			container_open;
>>>> +	bool			iopf_enabled;
>>>>  };
>>>>  
>>>>  struct vfio_domain {
>>>> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
>>>>  	return node ? iopf_group : NULL;
>>>>  }
>>>>  
>>>> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
>>>> +{
>>>> +	struct rb_node **link, *parent = NULL;
>>>> +	struct vfio_iopf_group *iopf_group;
>>>> +
>>>> +	mutex_lock(&iopf_group_list_lock);
>>>> +
>>>> +	link = &iopf_group_list.rb_node;
>>>> +
>>>> +	while (*link) {
>>>> +		parent = *link;
>>>> +		iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
>>>> +
>>>> +		if (new->iommu_group < iopf_group->iommu_group)
>>>> +			link = &(*link)->rb_left;
>>>> +		else
>>>> +			link = &(*link)->rb_right;
>>>> +	}
>>>> +
>>>> +	rb_link_node(&new->node, parent, link);
>>>> +	rb_insert_color(&new->node, &iopf_group_list);
>>>> +
>>>> +	mutex_unlock(&iopf_group_list_lock);
>>>> +}
>>>> +
>>>> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
>>>> +{
>>>> +	mutex_lock(&iopf_group_list_lock);
>>>> +	rb_erase(&old->node, &iopf_group_list);
>>>> +	mutex_unlock(&iopf_group_list_lock);
>>>> +}
>>>> +
>>>>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>  {
>>>>  	struct mm_struct *mm;
>>>> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
>>>>  	list_splice_tail(iova_copy, iova);
>>>>  }
>>>>  
>>>> +static int vfio_dev_domian_nested(struct device *dev, int *nested)  
>>>
>>>
>>> s/domian/domain/
>>>
>>>   
>>>> +{
>>>> +	struct iommu_domain *domain;
>>>> +
>>>> +	domain = iommu_get_domain_for_dev(dev);
>>>> +	if (!domain)
>>>> +		return -ENODEV;
>>>> +
>>>> +	return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);  
>>>
>>>
>>> Wouldn't this be easier to use if it returned bool, false on -errno?  
>>
>> Make sense.
>>
>>>
>>>   
>>>> +}
>>>> +
>>>> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
>>>> +
>>>> +static int dev_enable_iopf(struct device *dev, void *data)
>>>> +{
>>>> +	int *enabled_dev_cnt = data;
>>>> +	int nested;
>>>> +	u32 flags;
>>>> +	int ret;
>>>> +
>>>> +	ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	ret = vfio_dev_domian_nested(dev, &nested);
>>>> +	if (ret)
>>>> +		goto out_disable;
>>>> +
>>>> +	if (nested)
>>>> +		flags = FAULT_REPORT_NESTED_L2;
>>>> +	else
>>>> +		flags = FAULT_REPORT_FLAT;
>>>> +
>>>> +	ret = iommu_register_device_fault_handler(dev,
>>>> +				vfio_iommu_type1_dma_map_iopf, flags, dev);
>>>> +	if (ret)
>>>> +		goto out_disable;
>>>> +
>>>> +	(*enabled_dev_cnt)++;
>>>> +	return 0;
>>>> +
>>>> +out_disable:
>>>> +	iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int dev_disable_iopf(struct device *dev, void *data)
>>>> +{
>>>> +	int *enabled_dev_cnt = data;
>>>> +
>>>> +	if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
>>>> +		return -1;  
>>>
>>>
>>> Use an errno.>
>>>   
>>>> +
>>>> +	WARN_ON(iommu_unregister_device_fault_handler(dev));
>>>> +	WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
>>>> +
>>>> +	if (enabled_dev_cnt)
>>>> +		(*enabled_dev_cnt)--;  
>>>
>>>
>>> Why do we need these counters?  
>>
>> We use this counter to record the number of IOPF enabled devices, and if
>> any device fails to be enabled, we will exit the loop (three nested loop)
>> and go to the unwind process, which will disable the first @enabled_dev_cnt
>> devices.
>>
>>>   
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>>  					 struct iommu_group *iommu_group)
>>>>  {
>>>> @@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>>  	struct iommu_domain_geometry geo;
>>>>  	LIST_HEAD(iova_copy);
>>>>  	LIST_HEAD(group_resv_regions);
>>>> +	int iopf_enabled_dev_cnt = 0;
>>>> +	struct vfio_iopf_group *iopf_group = NULL;
>>>>  
>>>>  	mutex_lock(&iommu->lock);
>>>>  
>>>> @@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>>  	if (ret)
>>>>  		goto out_domain;
>>>>  
>>>> +	if (iommu->iopf_enabled) {
>>>> +		ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
>>>> +					       dev_enable_iopf);
>>>> +		if (ret)
>>>> +			goto out_detach;
>>>> +
>>>> +		iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
>>>> +		if (!iopf_group) {
>>>> +			ret = -ENOMEM;
>>>> +			goto out_detach;
>>>> +		}
>>>> +
>>>> +		iopf_group->iommu_group = iommu_group;
>>>> +		iopf_group->iommu = iommu;
>>>> +
>>>> +		vfio_link_iopf_group(iopf_group);  
>>>
>>> This seems backwards, once we enable iopf we can take a fault, so the
>>> structure to lookup the data for the device should be setup first.  I'm
>>> still not convinced this iopf_group rb tree is a good solution to get
>>> from the device to the iommu either.  vfio-core could traverse dev ->
>>> iommu_group -> vfio_group -> container -> iommu_data.  
>>
>> Make sense.
>>
>>>
>>>   
>>>> +	}
>>>> +
>>>>  	/* Get aperture info */
>>>>  	iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
>>>>  
>>>> @@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>>  	vfio_test_domain_fgsp(domain);
>>>>  
>>>>  	/* replay mappings on new domains */
>>>> -	ret = vfio_iommu_replay(iommu, domain);
>>>> -	if (ret)
>>>> -		goto out_detach;
>>>> +	if (!iommu->iopf_enabled) {
>>>> +		ret = vfio_iommu_replay(iommu, domain);
>>>> +		if (ret)
>>>> +			goto out_detach;
>>>> +	}  
>>>
>>> Test in vfio_iommu_replay()?  
>>
>> Not yet, I will do more tests later.
>>
>>>   
>>>>  
>>>>  	if (resv_msi) {
>>>>  		ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
>>>> @@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>>  	iommu_domain_free(domain->domain);
>>>>  	vfio_iommu_iova_free(&iova_copy);
>>>>  	vfio_iommu_resv_free(&group_resv_regions);
>>>> +	if (iommu->iopf_enabled) {
>>>> +		if (iopf_group) {
>>>> +			vfio_unlink_iopf_group(iopf_group);
>>>> +			kfree(iopf_group);
>>>> +		}
>>>> +
>>>> +		iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
>>>> +					 dev_disable_iopf);  
>>>
>>> Backwards, disable fault reporting before unlinking lookup.  
>>
>> Make sense.
>>
>>>   
>>>> +	}
>>>>  out_free:
>>>>  	kfree(domain);
>>>>  	kfree(group);
>>>> @@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>>>>  		if (!group)
>>>>  			continue;
>>>>  
>>>> +		if (iommu->iopf_enabled) {
>>>> +			struct vfio_iopf_group *iopf_group;
>>>> +
>>>> +			iopf_group = vfio_find_iopf_group(iommu_group);
>>>> +			if (!WARN_ON(!iopf_group)) {
>>>> +				vfio_unlink_iopf_group(iopf_group);
>>>> +				kfree(iopf_group);
>>>> +			}
>>>> +
>>>> +			iommu_group_for_each_dev(iommu_group, NULL,
>>>> +						 dev_disable_iopf);  
>>>
>>>
>>> Backwards.  Also unregistering the fault handler can fail if there are
>>> pending faults.  It appears the fault handler and iopf lookup could also
>>> race with this function, ex. fault handler gets an iopf object before
>>> it's removed from the rb-tree, blocks trying to acquire iommu->lock,
>>> disable_iopf fails, detach proceeds, fault handler has use after free
>>> error.  
>>
>> We have WARN_ON for the failure of the unregistering.
>>
>> Yeah, it seems that using vfio_iopf_group is not a good choice...
>>
>>>
>>>   
>>>> +		}
>>>> +
>>>>  		vfio_iommu_detach_group(domain, group);
>>>>  		update_dirty_scope = !group->pinned_page_dirty_scope;
>>>>  		list_del(&group->next);
>>>> @@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>>>  
>>>>  	vfio_iommu_iova_free(&iommu->iova_list);
>>>>  
>>>> +	if (iommu->iopf_enabled) {
>>>> +		mmu_notifier_unregister(&iommu->mn, iommu->mm);
>>>> +		mmdrop(iommu->mm);
>>>> +	}
>>>> +
>>>>  	kfree(iommu);
>>>>  }
>>>>  
>>>> @@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
>>>>  	.invalidate_range	= mn_invalidate_range,
>>>>  };
>>>>  
>>>> +static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
>>>> +{
>>>> +	struct vfio_domain *d;
>>>> +	struct vfio_group *g;
>>>> +	struct vfio_iopf_group *iopf_group;
>>>> +	int enabled_dev_cnt = 0;
>>>> +	int ret;
>>>> +
>>>> +	if (!current->mm)
>>>> +		return -ENODEV;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +
>>>> +	mmgrab(current->mm);  
>>>
>>>
>>> As above, this is a new and undocumented requirement.
>>>
>>>   
>>>> +	iommu->mm = current->mm;
>>>> +	iommu->mn.ops = &vfio_iommu_type1_mn_ops;
>>>> +	ret = mmu_notifier_register(&iommu->mn, current->mm);
>>>> +	if (ret)
>>>> +		goto out_drop;
>>>> +
>>>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>>>> +		list_for_each_entry(g, &d->group_list, next) {
>>>> +			ret = iommu_group_for_each_dev(g->iommu_group,
>>>> +					&enabled_dev_cnt, dev_enable_iopf);
>>>> +			if (ret)
>>>> +				goto out_unwind;
>>>> +
>>>> +			iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
>>>> +			if (!iopf_group) {
>>>> +				ret = -ENOMEM;
>>>> +				goto out_unwind;
>>>> +			}
>>>> +
>>>> +			iopf_group->iommu_group = g->iommu_group;
>>>> +			iopf_group->iommu = iommu;
>>>> +
>>>> +			vfio_link_iopf_group(iopf_group);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	iommu->iopf_enabled = true;
>>>> +	goto out_unlock;
>>>> +
>>>> +out_unwind:
>>>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>>>> +		list_for_each_entry(g, &d->group_list, next) {
>>>> +			iopf_group = vfio_find_iopf_group(g->iommu_group);
>>>> +			if (iopf_group) {
>>>> +				vfio_unlink_iopf_group(iopf_group);
>>>> +				kfree(iopf_group);
>>>> +			}
>>>> +
>>>> +			if (iommu_group_for_each_dev(g->iommu_group,
>>>> +					&enabled_dev_cnt, dev_disable_iopf))
>>>> +				goto out_unregister;  
>>>
>>>
>>> This seems broken, we break out of the unwind if any group_for_each
>>> fails??  
>>
>> The iommu_group_for_each_dev function will return a negative value if
>> we have disabled all previously enabled devices (@enabled_dev_cnt).
> 
> What's the harm in calling disable IOPF on a device where it's already
> disabled?  This is why I'm not sure the value of keeping a counter.  It
> also presumes that for_each_dev iterates in the same order every time.

In the current implementation (can only have one iommu dev fault handler),
there is no harm in calling disable IOPF on a device where it's already
disabled, but if someone else has already enabled IOPF on the device,
the IOPF enabling here would fail, and goto the unwind process, then
if we disable all the devices regardless of enabled or disabled, we
may disable the devices which were enabled by others.

And if we could have more than one handler per dev in the future, the
iommu interfaces could be re-designed to be harmless for this...

Besides, if we can't assume that for_each_dev iterates in the same order
every time, we may need to keep more than a counter...

>  
>>>> +		}
>>>> +	}
>>>> +
>>>> +out_unregister:
>>>> +	mmu_notifier_unregister(&iommu->mn, current->mm);
>>>> +
>>>> +out_drop:
>>>> +	iommu->mm = NULL;
>>>> +	mmdrop(current->mm);
>>>> +
>>>> +out_unlock:
>>>> +	mutex_unlock(&iommu->lock);
>>>> +	return ret;
>>>> +}  
>>>
>>>
>>> Is there an assumption that the user does this before performing any
>>> mapping?  I don't see where current mappings would be converted which
>>> means we'll have pinned pages leaked and vfio_dma objects without a
>>> pinned page bitmap.  
>>
>> Yeah, we have an assumption that this ioctl (ENABLE_IOPF) must be called
>> before any MAP/UNMAP ioctl...
>>
>> I will document these later. :-)
> 
> 
> That can't be fixed with documentation, if it's a requirement, the code
> needs to enforce it.  But also why is it a requirement?  Theoretically
> we could unpin and de-account existing mappings and allow the mmu
> notifier to handle these as well, right?  Thanks,

Make sense. It is not a necessary requirement.

Thanks,
Shenming

> 
> Alex
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler
  2021-05-24 22:11       ` Alex Williamson
@ 2021-05-27 11:16         ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-05-27 11:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:38:52 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> On 2021/5/19 2:58, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:14 +0800
>>> Shenming Lu <lushenming@huawei.com> wrote:
>>>   
>>>> VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
>>>> for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
>>>> serve the reported page faults from the IOMMU driver.
>>>>
>>>> Signed-off-by: Shenming Lu <lushenming@huawei.com>
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 114 ++++++++++++++++++++++++++++++++
>>>>  1 file changed, 114 insertions(+)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index 45cbfd4879a5..ab0ff60ee207 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -101,6 +101,7 @@ struct vfio_dma {
>>>>  	struct task_struct	*task;
>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>  	unsigned long		*bitmap;
>>>> +	unsigned long		*iopf_mapped_bitmap;
>>>>  };
>>>>  
>>>>  struct vfio_batch {
>>>> @@ -141,6 +142,16 @@ struct vfio_regions {
>>>>  	size_t len;
>>>>  };
>>>>  
>>>> +/* A global IOPF enabled group list */
>>>> +static struct rb_root iopf_group_list = RB_ROOT;
>>>> +static DEFINE_MUTEX(iopf_group_list_lock);
>>>> +
>>>> +struct vfio_iopf_group {
>>>> +	struct rb_node		node;
>>>> +	struct iommu_group	*iommu_group;
>>>> +	struct vfio_iommu	*iommu;
>>>> +};
>>>> +
>>>>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>>>>  					(!list_empty(&iommu->domain_list))
>>>>  
>>>> @@ -157,6 +168,10 @@ struct vfio_regions {
>>>>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
>>>>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>>>  
>>>> +#define IOPF_MAPPED_BITMAP_GET(dma, i)	\
>>>> +			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
>>>> +			       >> ((i) % BITS_PER_LONG)) & 0x1)  
>>>
>>>
>>> Can't we just use test_bit()?  
>>
>> Yeah, we can use it.
>>
>>>
>>>   
>>>> +
>>>>  #define WAITED 1
>>>>  
>>>>  static int put_pfn(unsigned long pfn, int prot);
>>>> @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Helper functions for iopf_group_list
>>>> + */
>>>> +static struct vfio_iopf_group *
>>>> +vfio_find_iopf_group(struct iommu_group *iommu_group)
>>>> +{
>>>> +	struct vfio_iopf_group *iopf_group;
>>>> +	struct rb_node *node;
>>>> +
>>>> +	mutex_lock(&iopf_group_list_lock);
>>>> +
>>>> +	node = iopf_group_list.rb_node;
>>>> +
>>>> +	while (node) {
>>>> +		iopf_group = rb_entry(node, struct vfio_iopf_group, node);
>>>> +
>>>> +		if (iommu_group < iopf_group->iommu_group)
>>>> +			node = node->rb_left;
>>>> +		else if (iommu_group > iopf_group->iommu_group)
>>>> +			node = node->rb_right;
>>>> +		else
>>>> +			break;
>>>> +	}
>>>> +
>>>> +	mutex_unlock(&iopf_group_list_lock);
>>>> +	return node ? iopf_group : NULL;
>>>> +}  
>>>
>>> This looks like a pretty heavy weight operation per DMA fault.
>>>
>>> I'm also suspicious of this validity of this iopf_group after we've
>>> dropped the locking, the ordering of patches makes this very confusing.  
>>
>> My thought was to include the handling of DMA faults completely in the type1
>> backend by introducing the vfio_iopf_group struct. But it seems that introducing
>> a struct with an unknown lifecycle causes more problems...
>> I will use the path from vfio-core as in the v2 for simplicity and validity.
>>
>> Sorry for the confusing, I will reconstruct the series later. :-)
>>
>>>   
>>>> +
>>>>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>  {
>>>>  	struct mm_struct *mm;
>>>> @@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>>>>  	return -EINVAL;
>>>>  }
>>>>  
>>>> +/* VFIO I/O Page Fault handler */
>>>> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)  
>>>   
>>> >From the comment, this seems like the IOMMU fault handler (the  
>>> construction of this series makes this difficult to follow) and
>>> eventually it handles more than DMA mapping, for example transferring
>>> faults to the device driver.  "dma_map_iopf" seems like a poorly scoped
>>> name.  
>>
>> Maybe just call it dev_fault_handler?
> 
> Better.
> 
>>>> +{
>>>> +	struct device *dev = (struct device *)data;
>>>> +	struct iommu_group *iommu_group;
>>>> +	struct vfio_iopf_group *iopf_group;
>>>> +	struct vfio_iommu *iommu;
>>>> +	struct vfio_dma *dma;
>>>> +	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
>>>> +	int access_flags = 0;
>>>> +	unsigned long bit_offset, vaddr, pfn;
>>>> +	int ret;
>>>> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>>>> +	struct iommu_page_response resp = {0};
>>>> +
>>>> +	if (fault->type != IOMMU_FAULT_PAGE_REQ)
>>>> +		return -EOPNOTSUPP;
>>>> +
>>>> +	iommu_group = iommu_group_get(dev);
>>>> +	if (!iommu_group)
>>>> +		return -ENODEV;
>>>> +
>>>> +	iopf_group = vfio_find_iopf_group(iommu_group);
>>>> +	iommu_group_put(iommu_group);
>>>> +	if (!iopf_group)
>>>> +		return -ENODEV;
>>>> +
>>>> +	iommu = iopf_group->iommu;
>>>> +
>>>> +	mutex_lock(&iommu->lock);  
>>>
>>> Again, I'm dubious of our ability to grab this lock from an object with
>>> an unknown lifecycle and races we might have with that group being
>>> detached or DMA unmapped.  Also, how effective is enabling IOMMU page
>>> faulting if we're serializing all faults within a container context?  
>>
>> Did you mean "efficient"?
> 
> Yes, that's more appropriate.
> 
>> I also worry about this as the mapping and unmapping of the faulting pages
>> are all with the same lock...
>> Is there a way to parallel them? Or could we have more fine grained lock
>> control?
> 
> It seems we need it; the current locking is designed for static
> mappings by the user, therefore concurrency hasn't been a priority.

I will try to implement it. :-)

> This again touches how far we want to extend type1 in the direction
> we intend to go with SVA/PASID support in IOASID.  Thanks,

Reply in the cover.

Thanks,
Shenming

> 
> Alex
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
  2021-05-27 11:03           ` Shenming Lu
@ 2021-05-27 11:18             ` Lu Baolu
  2021-06-01  4:36               ` Shenming Lu
  0 siblings, 1 reply; 40+ messages in thread
From: Lu Baolu @ 2021-05-27 11:18 UTC (permalink / raw)
  To: Shenming Lu, Alex Williamson
  Cc: baolu.lu, Eric Auger, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, yi.l.liu,
	Christoph Hellwig, Jonathan Cameron, Barry Song, wanghaibin.wang,
	yuzenghui

Hi Shenming and Alex,

On 5/27/21 7:03 PM, Shenming Lu wrote:
>> I haven't fully read all the references, but who imposes the fact that
>> there's only one fault handler per device?  If type1 could register one
>> handler and the vfio-pci bus driver another for the other level, would
>> we need this path through vfio-core?
> If we could register more than one handler per device, things would become
> much more simple, and the path through vfio-core would not be needed.
> 
> Hi Baolu,
> Is there any restriction for having more than one handler per device?
> 

Currently, each device could only have one fault handler. But one device
might consume multiple page tables. From this point of view, it's more
reasonable to have one handler per page table.

Best regards,
baolu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
  2021-05-24 22:11     ` Alex Williamson
@ 2021-05-27 11:25       ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-05-27 11:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Will Deacon, Robin Murphy, Joerg Roedel,
	Jean-Philippe Brucker, Eric Auger, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, Lu Baolu,
	yi.l.liu, Christoph Hellwig, Jonathan Cameron, Barry Song,
	wanghaibin.wang, yuzenghui

On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:37:21 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> Hi Alex,
>>
>> On 2021/5/19 2:57, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:12 +0800
>>> Shenming Lu <lushenming@huawei.com> wrote:
>>>   
>>>> Hi,
>>>>
>>>> Requesting for your comments and suggestions. :-)
>>>>
>>>> The static pinning and mapping problem in VFIO and possible solutions
>>>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>>>> Page Fault support for VFIO devices. Different from those relatively
>>>> complicated software approaches such as presenting a vIOMMU that provides
>>>> the DMA buffer information (might include para-virtualized optimizations),
>>>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>>>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>>>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>>>> support for VFIO passthrough based on the IOPF part of SVA in this series.  
>>>
>>> The SVA proposals are being reworked to make use of a new IOASID
>>> object, it's not clear to me that this shouldn't also make use of that
>>> work as it does a significant expansion of the type1 IOMMU with fault
>>> handlers that would duplicate new work using that new model.  
>>
>> It seems that the IOASID extension for guest SVA would not affect this series,
>> will we do any host-guest IOASID translation in the VFIO fault handler?
> 
> Surely it will, we don't currently have any IOMMU fault handling or
> forwarding of IOMMU faults through to the vfio bus driver, both of
> those would be included in an IOASID implementation.  I think Jason's
> vision is to use IOASID to deprecate type1 for all use cases, so even
> if we were to choose to implement IOPF in type1 we should agree on
> common interfaces with IOASID.

Yeah, the guest IOPF(L1) handling may include the host-guest IOASID translation,
which can be placed in the IOASID layer (in fact it can be placed in many places
such as the vfio pci driver since it don't really handle the fault event, it just
transfers the event to the vIOMMU).
But the host IOPF(L2) has no relationship with IOASID at all, it needs to have a
knowledge of the vfio_dma ranges.
Could we add the host IOPF support to type1 first (integrate it within the MAP ioctl)?
And we may migrate the generic iommu controls (such as MAP/UNMAP...) from type1 to
IOASID in the future (it seems to be a huge work, I will be very happy if I could
help this)... :-)

>  
>>>> We have measured its performance with UADK [4] (passthrough an accelerator
>>>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>>>
>>>> Run hisi_sec_test...
>>>>  - with varying sending times and message lengths
>>>>  - with/without IOPF enabled (speed slowdown)
>>>>
>>>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>>>>             slowdown (num of faults)
>>>>  times      VFIO IOPF      host SVA
>>>>  1          63.4% (518)    82.8% (512)
>>>>  100        22.9% (1058)   47.9% (1024)
>>>>  1000       2.6% (1071)    8.5% (1024)
>>>>
>>>> when msg_len = 10MB (and PREMAP_LEN = 512):
>>>>             slowdown (num of faults)
>>>>  times      VFIO IOPF
>>>>  1          32.6% (13)
>>>>  100        3.5% (26)
>>>>  1000       1.6% (26)  
>>>
>>> It seems like this is only an example that you can make a benchmark
>>> show anything you want.  The best results would be to pre-map
>>> everything, which is what we have without this series.  What is an
>>> acceptable overhead to incur to avoid page pinning?  What if userspace
>>> had more fine grained control over which mappings were available for
>>> faulting and which were statically mapped?  I don't really see what
>>> sense the pre-mapping range makes.  If I assume the user is QEMU in a
>>> non-vIOMMU configuration, pre-mapping the beginning of each RAM section
>>> doesn't make any logical sense relative to device DMA.  
>>
>> As you said in Patch 4, we can introduce full end-to-end functionality
>> before trying to improve performance, and I will drop the pre-mapping patch
>> in the current stage...
>>
>> Is there a need that userspace wants more fine grained control over which
>> mappings are available for faulting? If so, we may evolve the MAP ioctl
>> to support for specifying the faulting range.
> 
> You're essentially enabling this for a vfio bus driver via patch 7/8,
> pinning for selective DMA faulting.  How would a driver in userspace
> make equivalent requests?  In the case of performance, the user could
> mlock the page but they have no mechanism here to pre-fault it.  Should
> they?

Make sense.

It seems that we should additionally iommu_map the pages which are IOPF
enabled and pinned in vfio_iommu_type1_pin_pages, and there is no need
to add more tracking structures in Patch 7...

> 
>> As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
>> (and page faults occur almost only when first accessing), and it seems that
>> there is a large optimization space compared to CPU page faulting.
> 
> Yes, there's of course going to be overhead in terms of latency for the
> page faults.  My point was more that when a host is not under memory
> pressure we should trend towards the performance of pinned, static
> mappings and we should be able to create arbitrarily large pre-fault
> behavior to show that.

Make sense.

> But I think what we really want to enable via IOPF is density, right?

density? Did you mean the proportion of the IOPF enabled mappings?

> Therefore how many more assigned device guests
> can you run on a host with IOPF?> How does the slope, plateau, and
> inflection point of their aggregate throughput compare to static
> pinning?  VM startup time is probably also a useful metric, ie. trading
> device latency for startup latency.  Thanks,

Yeah, these are what we have to consider and test later. :-)
And the slope, plateau, and inflection point of the aggregate throughput
may depend on the specific device driver's behavior (such as whether they
reuse the DMA buffer)...

Thanks,
Shenming

> 
> Alex
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
  2021-05-27 11:18             ` Lu Baolu
@ 2021-06-01  4:36               ` Shenming Lu
  0 siblings, 0 replies; 40+ messages in thread
From: Shenming Lu @ 2021-06-01  4:36 UTC (permalink / raw)
  To: Lu Baolu, Alex Williamson
  Cc: Eric Auger, Cornelia Huck, Will Deacon, Robin Murphy,
	Joerg Roedel, Jean-Philippe Brucker, kvm, linux-kernel,
	linux-arm-kernel, iommu, linux-api, Kevin Tian, yi.l.liu,
	Christoph Hellwig, Jonathan Cameron, Barry Song, wanghaibin.wang,
	yuzenghui

On 2021/5/27 19:18, Lu Baolu wrote:
> Hi Shenming and Alex,
> 
> On 5/27/21 7:03 PM, Shenming Lu wrote:
>>> I haven't fully read all the references, but who imposes the fact that
>>> there's only one fault handler per device?  If type1 could register one
>>> handler and the vfio-pci bus driver another for the other level, would
>>> we need this path through vfio-core?
>> If we could register more than one handler per device, things would become
>> much more simple, and the path through vfio-core would not be needed.
>>
>> Hi Baolu,
>> Is there any restriction for having more than one handler per device?
>>
> 
> Currently, each device could only have one fault handler. But one device
> might consume multiple page tables. From this point of view, it's more
> reasonable to have one handler per page table.

Sounds good to me. I have pointed it out in the IOASID uAPI proposal. :-)

Thanks,
Shenming

> 
> Best regards,
> baolu
> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-06-01  4:38 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-09  3:44 [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
2021-04-09  3:44 ` [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework Shenming Lu
2021-05-18 18:58   ` Alex Williamson
2021-05-21  6:37     ` Shenming Lu
2021-04-09  3:44 ` [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler Shenming Lu
2021-05-18 18:58   ` Alex Williamson
2021-05-21  6:38     ` Shenming Lu
2021-05-24 22:11       ` Alex Williamson
2021-05-27 11:16         ` Shenming Lu
2021-04-09  3:44 ` [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning Shenming Lu
2021-05-18 18:58   ` Alex Williamson
2021-05-21  6:37     ` Shenming Lu
2021-04-09  3:44 ` [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling Shenming Lu
2021-05-18 18:58   ` Alex Williamson
2021-05-21  6:37     ` Shenming Lu
2021-04-09  3:44 ` [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF Shenming Lu
2021-05-18 18:58   ` Alex Williamson
2021-05-21  6:38     ` Shenming Lu
2021-05-24 22:11       ` Alex Williamson
2021-05-27 11:15         ` Shenming Lu
2021-04-09  3:44 ` [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled Shenming Lu
2021-05-18 18:58   ` Alex Williamson
2021-05-21  6:39     ` Shenming Lu
2021-04-09  3:44 ` [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support Shenming Lu
2021-05-18 18:58   ` Alex Williamson
2021-05-21  6:39     ` Shenming Lu
2021-04-09  3:44 ` [RFC PATCH v3 8/8] vfio: Add nested IOPF support Shenming Lu
2021-05-18 18:58   ` Alex Williamson
2021-05-21  7:59     ` Shenming Lu
2021-05-24 13:11       ` Shenming Lu
2021-05-24 22:11         ` Alex Williamson
2021-05-27 11:03           ` Shenming Lu
2021-05-27 11:18             ` Lu Baolu
2021-06-01  4:36               ` Shenming Lu
2021-04-26  1:41 ` [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough Shenming Lu
2021-05-11 11:30   ` Shenming Lu
2021-05-18 18:57 ` Alex Williamson
2021-05-21  6:37   ` Shenming Lu
2021-05-24 22:11     ` Alex Williamson
2021-05-27 11:25       ` Shenming Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).