iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
@ 2023-05-30  5:37 Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 01/17] iommu: Move iommu fault data to linux/iommu.h Lu Baolu
                   ` (19 more replies)
  0 siblings, 20 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

Hi folks,

This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. The use case is nested
translation, where modern IOMMU hardware supports two-stage translation
tables. The second-stage translation table is managed by the host VMM
while the first-stage translation table is owned by the user space.
Hence, any IO page fault that occurs on the first-stage page table
should be delivered to the user space and handled there. The user space
should respond the page fault handling result to the device top-down
through the IOMMUFD response uAPI.

User space indicates its capablity of handling IO page faults by setting
a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
will then setup its infrastructure for page fault delivery. Together
with the iopf-capable flag, user space should also provide an eventfd
where it will listen on any down-top page fault messages.

On a successful return of the allocation of iopf-capable HWPT, a fault
fd will be returned. User space can open and read fault messages from it
once the eventfd is signaled.

Besides the overall design, I'd like to hear comments about below
designs:

- The IOMMUFD fault message format. It is very similar to that in
  uapi/linux/iommu which has been discussed before and partially used by
  the IOMMU SVA implementation. I'd like to get more comments on the
  format when it comes to IOMMUFD.

- The timeout value for the pending page fault messages. Ideally we
  should determine the timeout value from the device configuration, but
  I failed to find any statement in the PCI specification (version 6.x).
  A default 100 milliseconds is selected in the implementation, but it
  leave the room for grow the code for per-device setting.

This series is only for review comment purpose. I used IOMMUFD selftest
to verify the hwpt allocation, attach/detach and replace. But I didn't
get a chance to run it with real hardware yet. I will do more test in
the subsequent versions when I am confident that I am heading on the
right way.

This series is based on the latest implementation of the nested
translation under discussion. The whole series and related patches are
available on gitbub:

https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v1

Best regards,
baolu

Lu Baolu (17):
  iommu: Move iommu fault data to linux/iommu.h
  iommu: Support asynchronous I/O page fault response
  iommu: Add helper to set iopf handler for domain
  iommu: Pass device parameter to iopf handler
  iommu: Split IO page fault handling from SVA
  iommu: Add iommu page fault cookie helpers
  iommufd: Add iommu page fault data
  iommufd: IO page fault delivery initialization and release
  iommufd: Add iommufd hwpt iopf handler
  iommufd: Add IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE for hwpt_alloc
  iommufd: Deliver fault messages to user space
  iommufd: Add io page fault response support
  iommufd: Add a timer for each iommufd fault data
  iommufd: Drain all pending faults when destroying hwpt
  iommufd: Allow new hwpt_alloc flags
  iommufd/selftest: Add IOPF feature for mock devices
  iommufd/selftest: Cover iopf-capable nested hwpt

 include/linux/iommu.h                         | 175 +++++++++-
 drivers/iommu/{iommu-sva.h => io-pgfault.h}   |  25 +-
 drivers/iommu/iommu-priv.h                    |   3 +
 drivers/iommu/iommufd/iommufd_private.h       |  32 ++
 include/uapi/linux/iommu.h                    | 161 ---------
 include/uapi/linux/iommufd.h                  |  73 +++-
 tools/testing/selftests/iommu/iommufd_utils.h |  20 +-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |   2 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   2 +-
 drivers/iommu/intel/iommu.c                   |   2 +-
 drivers/iommu/intel/svm.c                     |   2 +-
 drivers/iommu/io-pgfault.c                    |   7 +-
 drivers/iommu/iommu-sva.c                     |   4 +-
 drivers/iommu/iommu.c                         |  50 ++-
 drivers/iommu/iommufd/device.c                |  64 +++-
 drivers/iommu/iommufd/hw_pagetable.c          | 318 +++++++++++++++++-
 drivers/iommu/iommufd/main.c                  |   3 +
 drivers/iommu/iommufd/selftest.c              |  71 ++++
 tools/testing/selftests/iommu/iommufd.c       |  17 +-
 MAINTAINERS                                   |   1 -
 drivers/iommu/Kconfig                         |   4 +
 drivers/iommu/Makefile                        |   3 +-
 drivers/iommu/intel/Kconfig                   |   1 +
 23 files changed, 837 insertions(+), 203 deletions(-)
 rename drivers/iommu/{iommu-sva.h => io-pgfault.h} (71%)
 delete mode 100644 include/uapi/linux/iommu.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC PATCHES 01/17] iommu: Move iommu fault data to linux/iommu.h
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 02/17] iommu: Support asynchronous I/O page fault response Lu Baolu
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

The iommu fault data is currently defined in uapi/linux/iommu.h, but is
only used inside the iommu subsystem. Move it to linux/iommu.h, where it
will be more accessible to kernel drivers.

With this done, uapi/linux/iommu.h becomes empty and can be removed from
the tree. And we can further discuss how to define the iommu fault data
that iommufd could use to route the faults to user space and handle the
fault response if needed.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/iommu.h      | 151 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/iommu.h | 161 -------------------------------------
 MAINTAINERS                |   1 -
 3 files changed, 150 insertions(+), 163 deletions(-)
 delete mode 100644 include/uapi/linux/iommu.h

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 8ff1bb3a4e1a..d6a93de7d1dd 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -13,7 +13,6 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/of.h>
-#include <uapi/linux/iommu.h>
 #include <uapi/linux/iommufd.h>
 
 #define IOMMU_READ	(1 << 0)
@@ -43,6 +42,156 @@ struct iommu_sva;
 struct iommu_fault_event;
 struct iommu_dma_cookie;
 
+#define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
+#define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
+#define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
+#define IOMMU_FAULT_PERM_PRIV	(1 << 3) /* privileged */
+
+/* Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* Could not access the PASID table (fetch caused external abort) */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/* PASID entry is invalid or has configuration errors */
+	IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/*
+	 * An external abort occurred fetching (or updating) a translation
+	 * table descriptor
+	 */
+	IOMMU_FAULT_REASON_WALK_EABT,
+
+	/*
+	 * Could not access the page table entry (Bad address),
+	 * actual translation fault
+	 */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+
+	/* access flag check failed */
+	IOMMU_FAULT_REASON_ACCESS,
+
+	/* Output address of a translation stage caused Address Size fault */
+	IOMMU_FAULT_REASON_OOR_ADDRESS,
+};
+
+/**
+ * struct iommu_fault_unrecoverable - Unrecoverable fault data
+ * @reason: reason of the fault, from &enum iommu_fault_reason
+ * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)
+ * @pasid: Process Address Space ID
+ * @perm: requested permission access using by the incoming transaction
+ *        (IOMMU_FAULT_PERM_* values)
+ * @addr: offending page address
+ * @fetch_addr: address that caused a fetch abort, if any
+ */
+struct iommu_fault_unrecoverable {
+	__u32	reason;
+#define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
+#define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 1)
+#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	perm;
+	__u64	addr;
+	__u64	fetch_addr;
+};
+
+/**
+ * struct iommu_fault_page_request - Page Request data
+ * @flags: encodes whether the corresponding fields are valid and whether this
+ *         is the last page in group (IOMMU_FAULT_PAGE_REQUEST_* values).
+ *         When IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID is set, the page response
+ *         must have the same PASID value as the page request. When it is clear,
+ *         the page response should not have a PASID.
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
+ * @addr: page address
+ * @private_data: device-specific private information
+ */
+struct iommu_fault_page_request {
+#define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID	(1 << 0)
+#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
+#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
+#define IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID	(1 << 3)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	perm;
+	__u64	addr;
+	__u64	private_data[2];
+};
+
+/**
+ * struct iommu_fault - Generic fault data
+ * @type: fault type from &enum iommu_fault_type
+ * @padding: reserved for future use (should be zero)
+ * @event: fault event, when @type is %IOMMU_FAULT_DMA_UNRECOV
+ * @prm: Page Request message, when @type is %IOMMU_FAULT_PAGE_REQ
+ * @padding2: sets the fault size to allow for future extensions
+ */
+struct iommu_fault {
+	__u32	type;
+	__u32	padding;
+	union {
+		struct iommu_fault_unrecoverable event;
+		struct iommu_fault_page_request prm;
+		__u8 padding2[56];
+	};
+};
+
+/**
+ * enum iommu_page_response_code - Return status of fault handlers
+ * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
+ *	populated, retry the access. This is "Success" in PCI PRI.
+ * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
+ *	this device if possible. This is "Response Failure" in PCI PRI.
+ * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
+ *	access. This is "Invalid Request" in PCI PRI.
+ */
+enum iommu_page_response_code {
+	IOMMU_PAGE_RESP_SUCCESS = 0,
+	IOMMU_PAGE_RESP_INVALID,
+	IOMMU_PAGE_RESP_FAILURE,
+};
+
+/**
+ * struct iommu_page_response - Generic page response information
+ * @argsz: User filled size of this data
+ * @version: API version of this structure
+ * @flags: encodes whether the corresponding fields are valid
+ *         (IOMMU_FAULT_PAGE_RESPONSE_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @code: response code from &enum iommu_page_response_code
+ */
+struct iommu_page_response {
+	__u32	argsz;
+#define IOMMU_PAGE_RESP_VERSION_1	1
+	__u32	version;
+#define IOMMU_PAGE_RESP_PASID_VALID	(1 << 0)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	code;
+};
+
 /* iommu fault flags */
 #define IOMMU_FAULT_READ	0x0
 #define IOMMU_FAULT_WRITE	0x1
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
deleted file mode 100644
index 65d8b0234f69..000000000000
--- a/include/uapi/linux/iommu.h
+++ /dev/null
@@ -1,161 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-/*
- * IOMMU user API definitions
- */
-
-#ifndef _UAPI_IOMMU_H
-#define _UAPI_IOMMU_H
-
-#include <linux/types.h>
-
-#define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
-#define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
-#define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
-#define IOMMU_FAULT_PERM_PRIV	(1 << 3) /* privileged */
-
-/* Generic fault types, can be expanded IRQ remapping fault */
-enum iommu_fault_type {
-	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
-	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
-};
-
-enum iommu_fault_reason {
-	IOMMU_FAULT_REASON_UNKNOWN = 0,
-
-	/* Could not access the PASID table (fetch caused external abort) */
-	IOMMU_FAULT_REASON_PASID_FETCH,
-
-	/* PASID entry is invalid or has configuration errors */
-	IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
-
-	/*
-	 * PASID is out of range (e.g. exceeds the maximum PASID
-	 * supported by the IOMMU) or disabled.
-	 */
-	IOMMU_FAULT_REASON_PASID_INVALID,
-
-	/*
-	 * An external abort occurred fetching (or updating) a translation
-	 * table descriptor
-	 */
-	IOMMU_FAULT_REASON_WALK_EABT,
-
-	/*
-	 * Could not access the page table entry (Bad address),
-	 * actual translation fault
-	 */
-	IOMMU_FAULT_REASON_PTE_FETCH,
-
-	/* Protection flag check failed */
-	IOMMU_FAULT_REASON_PERMISSION,
-
-	/* access flag check failed */
-	IOMMU_FAULT_REASON_ACCESS,
-
-	/* Output address of a translation stage caused Address Size fault */
-	IOMMU_FAULT_REASON_OOR_ADDRESS,
-};
-
-/**
- * struct iommu_fault_unrecoverable - Unrecoverable fault data
- * @reason: reason of the fault, from &enum iommu_fault_reason
- * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)
- * @pasid: Process Address Space ID
- * @perm: requested permission access using by the incoming transaction
- *        (IOMMU_FAULT_PERM_* values)
- * @addr: offending page address
- * @fetch_addr: address that caused a fetch abort, if any
- */
-struct iommu_fault_unrecoverable {
-	__u32	reason;
-#define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
-#define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 1)
-#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 2)
-	__u32	flags;
-	__u32	pasid;
-	__u32	perm;
-	__u64	addr;
-	__u64	fetch_addr;
-};
-
-/**
- * struct iommu_fault_page_request - Page Request data
- * @flags: encodes whether the corresponding fields are valid and whether this
- *         is the last page in group (IOMMU_FAULT_PAGE_REQUEST_* values).
- *         When IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID is set, the page response
- *         must have the same PASID value as the page request. When it is clear,
- *         the page response should not have a PASID.
- * @pasid: Process Address Space ID
- * @grpid: Page Request Group Index
- * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
- * @addr: page address
- * @private_data: device-specific private information
- */
-struct iommu_fault_page_request {
-#define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID	(1 << 0)
-#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
-#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
-#define IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID	(1 << 3)
-	__u32	flags;
-	__u32	pasid;
-	__u32	grpid;
-	__u32	perm;
-	__u64	addr;
-	__u64	private_data[2];
-};
-
-/**
- * struct iommu_fault - Generic fault data
- * @type: fault type from &enum iommu_fault_type
- * @padding: reserved for future use (should be zero)
- * @event: fault event, when @type is %IOMMU_FAULT_DMA_UNRECOV
- * @prm: Page Request message, when @type is %IOMMU_FAULT_PAGE_REQ
- * @padding2: sets the fault size to allow for future extensions
- */
-struct iommu_fault {
-	__u32	type;
-	__u32	padding;
-	union {
-		struct iommu_fault_unrecoverable event;
-		struct iommu_fault_page_request prm;
-		__u8 padding2[56];
-	};
-};
-
-/**
- * enum iommu_page_response_code - Return status of fault handlers
- * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
- *	populated, retry the access. This is "Success" in PCI PRI.
- * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
- *	this device if possible. This is "Response Failure" in PCI PRI.
- * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
- *	access. This is "Invalid Request" in PCI PRI.
- */
-enum iommu_page_response_code {
-	IOMMU_PAGE_RESP_SUCCESS = 0,
-	IOMMU_PAGE_RESP_INVALID,
-	IOMMU_PAGE_RESP_FAILURE,
-};
-
-/**
- * struct iommu_page_response - Generic page response information
- * @argsz: User filled size of this data
- * @version: API version of this structure
- * @flags: encodes whether the corresponding fields are valid
- *         (IOMMU_FAULT_PAGE_RESPONSE_* values)
- * @pasid: Process Address Space ID
- * @grpid: Page Request Group Index
- * @code: response code from &enum iommu_page_response_code
- */
-struct iommu_page_response {
-	__u32	argsz;
-#define IOMMU_PAGE_RESP_VERSION_1	1
-	__u32	version;
-#define IOMMU_PAGE_RESP_PASID_VALID	(1 << 0)
-	__u32	flags;
-	__u32	pasid;
-	__u32	grpid;
-	__u32	code;
-};
-
-#endif /* _UAPI_IOMMU_H */
diff --git a/MAINTAINERS b/MAINTAINERS
index 7e0b87d5aa2e..5f0bb02cfbb3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10837,7 +10837,6 @@ F:	drivers/iommu/
 F:	include/linux/iommu.h
 F:	include/linux/iova.h
 F:	include/linux/of_iommu.h
-F:	include/uapi/linux/iommu.h
 
 IOSYS-MAP HELPERS
 M:	Thomas Zimmermann <tzimmermann@suse.de>
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 02/17] iommu: Support asynchronous I/O page fault response
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 01/17] iommu: Move iommu fault data to linux/iommu.h Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 03/17] iommu: Add helper to set iopf handler for domain Lu Baolu
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

Add a new page response code, IOMMU_PAGE_RESP_ASYNC, to indicate that the
domain's page fault handler doesn't respond the hardware immediately, but
do it in an asynchronous way.

The use case of this response code is the nested translation, where the
first-stage page table is owned by the VM guest and any page fault on
it should be propagated to the VM guest and page fault will be responded
in a different thread context later.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/iommu.h      | 2 ++
 drivers/iommu/io-pgfault.c | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d6a93de7d1dd..fce7ad81206f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -164,11 +164,13 @@ struct iommu_fault {
  *	this device if possible. This is "Response Failure" in PCI PRI.
  * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
  *	access. This is "Invalid Request" in PCI PRI.
+ * @IOMMU_PAGE_RESP_ASYNC: Will response later by calling iommu_page_response().
  */
 enum iommu_page_response_code {
 	IOMMU_PAGE_RESP_SUCCESS = 0,
 	IOMMU_PAGE_RESP_INVALID,
 	IOMMU_PAGE_RESP_FAILURE,
+	IOMMU_PAGE_RESP_ASYNC,
 };
 
 /**
diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
index e5b8b9110c13..83f8055a0e09 100644
--- a/drivers/iommu/io-pgfault.c
+++ b/drivers/iommu/io-pgfault.c
@@ -96,7 +96,8 @@ static void iopf_handler(struct work_struct *work)
 			kfree(iopf);
 	}
 
-	iopf_complete_group(group->dev, &group->last_fault, status);
+	if (status != IOMMU_PAGE_RESP_ASYNC)
+		iopf_complete_group(group->dev, &group->last_fault, status);
 	kfree(group);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 03/17] iommu: Add helper to set iopf handler for domain
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 01/17] iommu: Move iommu fault data to linux/iommu.h Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 02/17] iommu: Support asynchronous I/O page fault response Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 04/17] iommu: Pass device parameter to iopf handler Lu Baolu
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

To avoid open code everywhere. No intentional functionality change.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/iommu.h | 10 ++++++++++
 drivers/iommu/iommu.c |  3 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index fce7ad81206f..f554328528bc 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -273,6 +273,16 @@ static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
 	return domain->type & __IOMMU_DOMAIN_DMA_API;
 }
 
+static inline void
+iommu_domain_set_iopf_handler(struct iommu_domain *domain,
+		enum iommu_page_response_code (*handler)(struct iommu_fault *fault,
+							 void *data),
+		void *data)
+{
+	domain->iopf_handler = handler;
+	domain->fault_data = data;
+}
+
 enum iommu_cap {
 	IOMMU_CAP_CACHE_COHERENCY,	/* IOMMU_CACHE is supported */
 	IOMMU_CAP_NOEXEC,		/* IOMMU_NOEXEC flag */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 13a2e0e26884..fd65ed1d3642 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3419,8 +3419,7 @@ struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
 	domain->type = IOMMU_DOMAIN_SVA;
 	mmgrab(mm);
 	domain->mm = mm;
-	domain->iopf_handler = iommu_sva_handle_iopf;
-	domain->fault_data = mm;
+	iommu_domain_set_iopf_handler(domain, iommu_sva_handle_iopf, mm);
 
 	return domain;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 04/17] iommu: Pass device parameter to iopf handler
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (2 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 03/17] iommu: Add helper to set iopf handler for domain Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 05/17] iommu: Split IO page fault handling from SVA Lu Baolu
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

So that IOMMUFD can route the io page fault to the user space with the
device id, which was generated when the user space bound the device to
an IOAS.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/iommu.h      | 2 ++
 drivers/iommu/iommu-sva.h  | 5 +++--
 drivers/iommu/io-pgfault.c | 2 +-
 drivers/iommu/iommu-sva.c  | 2 +-
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f554328528bc..f69ac54dc583 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -254,6 +254,7 @@ struct iommu_domain {
 	struct iommu_domain_geometry geometry;
 	struct iommu_dma_cookie *iova_cookie;
 	enum iommu_page_response_code (*iopf_handler)(struct iommu_fault *fault,
+						      struct device *dev,
 						      void *data);
 	void *fault_data;
 	union {
@@ -276,6 +277,7 @@ static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
 static inline void
 iommu_domain_set_iopf_handler(struct iommu_domain *domain,
 		enum iommu_page_response_code (*handler)(struct iommu_fault *fault,
+							 struct device *dev,
 							 void *data),
 		void *data)
 {
diff --git a/drivers/iommu/iommu-sva.h b/drivers/iommu/iommu-sva.h
index 54946b5a7caf..5333d6a26047 100644
--- a/drivers/iommu/iommu-sva.h
+++ b/drivers/iommu/iommu-sva.h
@@ -23,7 +23,8 @@ struct iopf_queue *iopf_queue_alloc(const char *name);
 void iopf_queue_free(struct iopf_queue *queue);
 int iopf_queue_discard_partial(struct iopf_queue *queue);
 enum iommu_page_response_code
-iommu_sva_handle_iopf(struct iommu_fault *fault, void *data);
+iommu_sva_handle_iopf(struct iommu_fault *fault,
+		      struct device *dev, void *data);
 
 #else /* CONFIG_IOMMU_SVA */
 static inline int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
@@ -63,7 +64,7 @@ static inline int iopf_queue_discard_partial(struct iopf_queue *queue)
 }
 
 static inline enum iommu_page_response_code
-iommu_sva_handle_iopf(struct iommu_fault *fault, void *data)
+iommu_sva_handle_iopf(struct iommu_fault *fault, struct device *dev, void *data)
 {
 	return IOMMU_PAGE_RESP_INVALID;
 }
diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
index 83f8055a0e09..dedc2ea70970 100644
--- a/drivers/iommu/io-pgfault.c
+++ b/drivers/iommu/io-pgfault.c
@@ -88,7 +88,7 @@ static void iopf_handler(struct work_struct *work)
 		 * faults in the group if there is an error.
 		 */
 		if (status == IOMMU_PAGE_RESP_SUCCESS)
-			status = domain->iopf_handler(&iopf->fault,
+			status = domain->iopf_handler(&iopf->fault, group->dev,
 						      domain->fault_data);
 
 		if (!(iopf->fault.prm.flags &
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 9821bc44f5ac..02574a49275a 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -157,7 +157,7 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
  * I/O page fault handler for SVA
  */
 enum iommu_page_response_code
-iommu_sva_handle_iopf(struct iommu_fault *fault, void *data)
+iommu_sva_handle_iopf(struct iommu_fault *fault, struct device *dev, void *data)
 {
 	vm_fault_t ret;
 	struct vm_area_struct *vma;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 05/17] iommu: Split IO page fault handling from SVA
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (3 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 04/17] iommu: Pass device parameter to iopf handler Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 06/17] iommu: Add iommu page fault cookie helpers Lu Baolu
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

The current IO page fault handling framework is tightly coupled with the
SVA implementation, as SVA is the only use case that requires IO page
fault handling. However, with the introduction of nested translation, the
first level page table is now managed by userspace. This means that any
IO page fault generated for this first level IO address should be routed
to userspace and handled there.

To support this, we need to split the IO page fault handling framework
from the SVA implementation, and make it generic for all use cases.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/iommu.h                         |  8 ++++++
 drivers/iommu/{iommu-sva.h => io-pgfault.h}   | 26 +++++--------------
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  2 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  2 +-
 drivers/iommu/intel/iommu.c                   |  2 +-
 drivers/iommu/intel/svm.c                     |  2 +-
 drivers/iommu/io-pgfault.c                    |  2 +-
 drivers/iommu/iommu-sva.c                     |  2 +-
 drivers/iommu/iommu.c                         |  2 +-
 drivers/iommu/Kconfig                         |  4 +++
 drivers/iommu/Makefile                        |  3 ++-
 drivers/iommu/intel/Kconfig                   |  1 +
 12 files changed, 29 insertions(+), 27 deletions(-)
 rename drivers/iommu/{iommu-sva.h => io-pgfault.h} (69%)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f69ac54dc583..c201704f9aea 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1399,6 +1399,9 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
 					struct mm_struct *mm);
 void iommu_sva_unbind_device(struct iommu_sva *handle);
 u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+enum iommu_page_response_code
+iommu_sva_handle_iopf(struct iommu_fault *fault,
+		      struct device *dev, void *data);
 #else
 static inline struct iommu_sva *
 iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1417,6 +1420,11 @@ static inline u32 iommu_sva_get_pasid(struct iommu_sva *handle)
 static inline void mm_pasid_init(struct mm_struct *mm) {}
 static inline bool mm_valid_pasid(struct mm_struct *mm) { return false; }
 static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline enum iommu_page_response_code
+iommu_sva_handle_iopf(struct iommu_fault *fault, struct device *dev, void *data)
+{
+	return IOMMU_PAGE_RESP_INVALID;
+}
 #endif /* CONFIG_IOMMU_SVA */
 
 #endif /* __LINUX_IOMMU_H */
diff --git a/drivers/iommu/iommu-sva.h b/drivers/iommu/io-pgfault.h
similarity index 69%
rename from drivers/iommu/iommu-sva.h
rename to drivers/iommu/io-pgfault.h
index 5333d6a26047..587844e36554 100644
--- a/drivers/iommu/iommu-sva.h
+++ b/drivers/iommu/io-pgfault.h
@@ -1,18 +1,15 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
- * SVA library for IOMMU drivers
+ * I/O page fault helpers for IOMMU drivers
  */
-#ifndef _IOMMU_SVA_H
-#define _IOMMU_SVA_H
+#ifndef _IOMMU_PGFAULT_H
+#define _IOMMU_PGFAULT_H
 
-#include <linux/mm_types.h>
-
-/* I/O Page fault */
 struct device;
 struct iommu_fault;
 struct iopf_queue;
 
-#ifdef CONFIG_IOMMU_SVA
+#ifdef CONFIG_IOMMU_PGFAULT
 int iommu_queue_iopf(struct iommu_fault *fault, void *cookie);
 
 int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev);
@@ -22,11 +19,8 @@ int iopf_queue_flush_dev(struct device *dev);
 struct iopf_queue *iopf_queue_alloc(const char *name);
 void iopf_queue_free(struct iopf_queue *queue);
 int iopf_queue_discard_partial(struct iopf_queue *queue);
-enum iommu_page_response_code
-iommu_sva_handle_iopf(struct iommu_fault *fault,
-		      struct device *dev, void *data);
 
-#else /* CONFIG_IOMMU_SVA */
+#else /* CONFIG_IOMMU_PGFAULT */
 static inline int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
 {
 	return -ENODEV;
@@ -62,11 +56,5 @@ static inline int iopf_queue_discard_partial(struct iopf_queue *queue)
 {
 	return -ENODEV;
 }
-
-static inline enum iommu_page_response_code
-iommu_sva_handle_iopf(struct iommu_fault *fault, struct device *dev, void *data)
-{
-	return IOMMU_PAGE_RESP_INVALID;
-}
-#endif /* CONFIG_IOMMU_SVA */
-#endif /* _IOMMU_SVA_H */
+#endif /* CONFIG_IOMMU_PGFAULT */
+#endif /* _IOMMU_PGFAULT_H */
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index a5a63b1c947e..a6401500585b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -10,7 +10,7 @@
 #include <linux/slab.h>
 
 #include "arm-smmu-v3.h"
-#include "../../iommu-sva.h"
+#include "../../io-pgfault.h"
 #include "../../io-pgtable-arm.h"
 
 struct arm_smmu_mmu_notifier {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8ec4ee5270b1..021e72eade5f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -30,7 +30,7 @@
 
 #include "arm-smmu-v3.h"
 #include "../../dma-iommu.h"
-#include "../../iommu-sva.h"
+#include "../../io-pgfault.h"
 
 static bool disable_bypass = true;
 module_param(disable_bypass, bool, 0444);
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 42288bd449a0..7473531c7568 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -26,7 +26,7 @@
 #include "iommu.h"
 #include "../dma-iommu.h"
 #include "../irq_remapping.h"
-#include "../iommu-sva.h"
+#include "../io-pgfault.h"
 #include "pasid.h"
 #include "cap_audit.h"
 #include "perfmon.h"
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index e95b339e9cdc..243edc81db75 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -22,7 +22,7 @@
 #include "iommu.h"
 #include "pasid.h"
 #include "perf.h"
-#include "../iommu-sva.h"
+#include "../io-pgfault.h"
 #include "trace.h"
 
 static irqreturn_t prq_event_thread(int irq, void *d);
diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
index dedc2ea70970..7e735369a041 100644
--- a/drivers/iommu/io-pgfault.c
+++ b/drivers/iommu/io-pgfault.c
@@ -11,7 +11,7 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 
-#include "iommu-sva.h"
+#include "io-pgfault.h"
 
 /**
  * struct iopf_queue - IO Page Fault queue
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 02574a49275a..585ee56e29d9 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -7,7 +7,7 @@
 #include <linux/sched/mm.h>
 #include <linux/iommu.h>
 
-#include "iommu-sva.h"
+#include "io-pgfault.h"
 
 static DEFINE_MUTEX(iommu_sva_lock);
 static DEFINE_IDA(iommu_global_pasid_ida);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fd65ed1d3642..cace57c066f4 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -36,7 +36,7 @@
 #include "dma-iommu.h"
 #include "iommu-priv.h"
 
-#include "iommu-sva.h"
+#include "io-pgfault.h"
 
 #include "iommu-priv.h"
 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index db98c3f86e8c..92ecaf21b355 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -157,6 +157,9 @@ config IOMMU_DMA
 config IOMMU_SVA
 	bool
 
+config IOMMU_PGFAULT
+	bool
+
 config FSL_PAMU
 	bool "Freescale IOMMU support"
 	depends on PCI
@@ -402,6 +405,7 @@ config ARM_SMMU_V3_SVA
 	bool "Shared Virtual Addressing support for the ARM SMMUv3"
 	depends on ARM_SMMU_V3
 	select IOMMU_SVA
+	select IOMMU_PGFAULT
 	select MMU_NOTIFIER
 	help
 	  Support for sharing process address spaces with devices using the
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 769e43d780ce..ff5c69c7cb02 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
-obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o io-pgfault.o
+obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
+obj-$(CONFIG_IOMMU_PGFAULT) += io-pgfault.o
 obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
 obj-$(CONFIG_APPLE_DART) += apple-dart.o
diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 2e56bd79f589..0c2d9202f8ff 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -15,6 +15,7 @@ config INTEL_IOMMU
 	select DMA_OPS
 	select IOMMU_API
 	select IOMMU_IOVA
+	select IOMMU_PGFAULT
 	select NEED_DMA_MAP_STATE
 	select DMAR_TABLE
 	select SWIOTLB
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 06/17] iommu: Add iommu page fault cookie helpers
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (4 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 05/17] iommu: Split IO page fault handling from SVA Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 07/17] iommufd: Add iommu page fault data Lu Baolu
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

Add an xarray in iommu_fault_param as place holder for per-{device, pasid}
fault cookie. The iommufd will use it to store the mapping of device object
ID and the device pointer. This allows the iommufd to quickly retrieve the
device object ID for a given {device, pasid} pair in the hot path of IO
page fault delivery.

Otherwise, the iommufd would have to maintain its own data structures to
map {device, pasid} pairs to object IDs, and then look up the object ID on
the critical path. This is not performance friendly.

The iommufd is supposed to set the cookie when a fault capable domain is
attached to the physical device or pasid, and clear the fault cookie when the
domain is removed.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/iommu.h      |  2 ++
 drivers/iommu/iommu-priv.h |  3 +++
 drivers/iommu/iommu.c      | 45 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c201704f9aea..9b0058ac971c 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -617,12 +617,14 @@ struct iommu_fault_event {
  * @data: handler private data
  * @faults: holds the pending faults which needs response
  * @lock: protect pending faults list
+ * @pasid_cookie: per-pasid fault cookie
  */
 struct iommu_fault_param {
 	iommu_dev_fault_handler_t handler;
 	void *data;
 	struct list_head faults;
 	struct mutex lock;
+	struct xarray pasid_cookie;
 };
 
 /**
diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
index a6e694f59f64..17ab989702a0 100644
--- a/drivers/iommu/iommu-priv.h
+++ b/drivers/iommu/iommu-priv.h
@@ -17,5 +17,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
 
 int iommu_group_replace_domain(struct iommu_group *group,
 			       struct iommu_domain *new_domain);
+void *iommu_set_device_fault_cookie(struct device *dev, ioasid_t pasid,
+				    void *cookie);
+void *iommu_get_device_fault_cookie(struct device *dev, ioasid_t pasid);
 
 #endif /* __LINUX_IOMMU_PRIV_H */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index cace57c066f4..2f81be7f3a90 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1270,6 +1270,7 @@ int iommu_register_device_fault_handler(struct device *dev,
 	param->fault_param->data = data;
 	mutex_init(&param->fault_param->lock);
 	INIT_LIST_HEAD(&param->fault_param->faults);
+	xa_init(&param->fault_param->pasid_cookie);
 
 done_unlock:
 	mutex_unlock(&param->lock);
@@ -1435,6 +1436,50 @@ int iommu_page_response(struct device *dev,
 }
 EXPORT_SYMBOL_GPL(iommu_page_response);
 
+/**
+ * iommu_set_device_fault_cookie - Set a fault cookie for per-{device, pasid}
+ * @dev: the device to set the cookie
+ * @pasid: the pasid on this device
+ * @cookie: the opaque data
+ *
+ * Return the old cookie on success, or ERR_PTR(err#) on failure.
+ */
+void *iommu_set_device_fault_cookie(struct device *dev, ioasid_t pasid,
+				    void *cookie)
+{
+	struct iommu_fault_param *fault_param;
+	void *curr;
+
+	if (!dev->iommu || !dev->iommu->fault_param)
+		return ERR_PTR(-ENODEV);
+
+	fault_param = dev->iommu->fault_param;
+	curr = xa_store(&fault_param->pasid_cookie, pasid, cookie, GFP_KERNEL);
+
+	return xa_is_err(curr) ? ERR_PTR(xa_err(curr)) : curr;
+}
+EXPORT_SYMBOL_NS_GPL(iommu_set_device_fault_cookie, IOMMUFD_INTERNAL);
+
+/**
+ * iommu_get_device_fault_cookie - Get the fault cookie for {device, pasid}
+ * @dev: the device to set the cookie
+ * @pasid: the pasid on this device
+ *
+ * Return the cookie on success, or ERR_PTR(err#) on failure.
+ */
+void *iommu_get_device_fault_cookie(struct device *dev, ioasid_t pasid)
+{
+	struct iommu_fault_param *fault_param;
+
+	if (!dev->iommu || !dev->iommu->fault_param)
+		return ERR_PTR(-ENODEV);
+
+	fault_param = dev->iommu->fault_param;
+
+	return xa_load(&fault_param->pasid_cookie, pasid);
+}
+EXPORT_SYMBOL_NS_GPL(iommu_get_device_fault_cookie, IOMMUFD_INTERNAL);
+
 /**
  * iommu_group_id - Return ID for a group
  * @group: the group to ID
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 07/17] iommufd: Add iommu page fault data
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (5 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 06/17] iommu: Add iommu page fault cookie helpers Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 08/17] iommufd: IO page fault delivery initialization and release Lu Baolu
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

For user to handle IO page faults generated by IOMMU hardware when
walking the HWPT managed by the user. One example of the use case
is nested translation, where the first-stage page table is managed
by the user space.

When allocating a user HWPT, the user could opt-in a flag named
IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE, which indicates that user is
capable of handling IO page faults generated for this HWPT. The
user also needs to allocate an eventfd and input it in event_fd
field of iommu_hwpt_alloc data.

On a successful return of hwpt allocation, the user can listen to
the event fd and retrieve the page faults by reading from the fd
returned at out_fault_fd. The format of the page fault data is
encoded in the format defined by struct iommu_hwpt_pgfault.

The iommu_hwpt_pgfault is mostly like the iommu_fault with some new
members like fault data size and the device object id where the page
fault was originated from.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/uapi/linux/iommufd.h | 44 +++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index e10e6f74cdf4..2c7c44c00da2 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -444,7 +444,11 @@ struct iommu_hwpt_arm_smmuv3 {
 /**
  * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC)
  * @size: sizeof(struct iommu_hwpt_alloc)
- * @flags: Must be 0
+ * @flags: Combination of IOMMU_HWPT_ALLOC_FLAGS_ flags
+ *  - IOPF_CAPABLE: User is capable of handling IO page faults. @event_fd
+ *    must be valid once this flag is set. On successful return, user can
+ *    listen to @event_fd and retrieve faults by reading @out_fault_fd.
+ *    The fault data is encoded in the format defined by iommu_hwpt_pgfault.
  * @dev_id: The device to allocate this HWPT for
  * @pt_id: The IOAS to connect this HWPT to
  * @out_hwpt_id: The ID of the new HWPT
@@ -482,6 +486,7 @@ struct iommu_hwpt_arm_smmuv3 {
 struct iommu_hwpt_alloc {
 	__u32 size;
 	__u32 flags;
+#define IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE		(1 << 0)
 	__u32 dev_id;
 	__u32 pt_id;
 	__u32 out_hwpt_id;
@@ -489,6 +494,8 @@ struct iommu_hwpt_alloc {
 	__u32 hwpt_type;
 	__u32 data_len;
 	__aligned_u64 data_uptr;
+	__u32 event_fd;
+	__u32 out_fault_fd;
 };
 #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
 
@@ -705,6 +712,41 @@ struct iommu_hwpt_invalidate {
 };
 #define IOMMU_HWPT_INVALIDATE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_INVALIDATE)
 
+/**
+ * struct iommu_hwpt_pgfault - iommu page fault data
+ * @size: sizeof(struct iommu_hwpt_pgfault)
+ * @flags: Combination of IOMMU_PGFAULT_FLAGS_ flags.
+ *  - PASID_VALID: @pasid field is valid
+ *  - LAST_PAGE: the last page fault in a group
+ *  - PRIV_DATA: @private_data field is valid
+ *  - RESP_NEEDS_PASID: the page response must have the same
+ *                      PASID value as the page request.
+ * @dev_id: id of the originated device
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @perm: requested page permissions (IOMMU_PGFAULT_PERM_* values)
+ * @addr: page address
+ * @private_data: device-specific private information
+ */
+struct iommu_hwpt_pgfault {
+	__u32 size;
+	__u32 flags;
+#define IOMMU_PGFAULT_FLAGS_PASID_VALID		(1 << 0)
+#define IOMMU_PGFAULT_FLAGS_LAST_PAGE		(1 << 1)
+#define IOMMU_PGFAULT_FLAGS_PRIV_DATA		(1 << 2)
+#define IOMMU_PGFAULT_FLAGS_RESP_NEEDS_PASID	(1 << 3)
+	__u32 dev_id;
+	__u32 pasid;
+	__u32 grpid;
+	__u32 perm;
+#define IOMMU_PGFAULT_PERM_READ			(1 << 0)
+#define IOMMU_PGFAULT_PERM_WRITE		(1 << 1)
+#define IOMMU_PGFAULT_PERM_EXEC			(1 << 2)
+#define IOMMU_PGFAULT_PERM_PRIV			(1 << 3)
+	__u64 addr;
+	__u64 private_data[2];
+};
+
 /**
  * struct iommu_device_set_data - ioctl(IOMMU_DEVICE_SET_DATA)
  * @size: sizeof(struct iommu_device_set_data)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 08/17] iommufd: IO page fault delivery initialization and release
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (6 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 07/17] iommufd: Add iommu page fault data Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 09/17] iommufd: Add iommufd hwpt iopf handler Lu Baolu
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

Add some housekeeping code for IO page fault dilivery. Add a fault field
in the iommufd_hw_pagetable structure to store pending IO page faults and
other related data.

The fault field is allocated when an IOPF-capable user HWPT (indicated by
IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE being set in the allocation user data)
is allocated. This field exists until the HWPT is destroyed. This also
implies that it is possible to determine whether a HWPT is IOPF capable by
checking the fault field.

When an IOPF-capable HWPT is attached to a device (could also be a PASID of
a device in the future), a fault cookie is allocated and set to the device.
The cookie is cleared and freed when HWPT is detached from the device.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommufd/iommufd_private.h | 12 +++++
 drivers/iommu/iommufd/device.c          | 61 +++++++++++++++++++++++--
 drivers/iommu/iommufd/hw_pagetable.c    | 55 ++++++++++++++++++++++
 3 files changed, 125 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index e951815f5707..5ff139acc5c0 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -236,6 +236,13 @@ int iommufd_option_rlimit_mode(struct iommu_option *cmd,
 
 int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
 
+struct hw_pgtable_fault {
+	struct mutex mutex;
+	struct list_head deliver;
+	struct list_head response;
+	struct eventfd_ctx *trigger;
+};
+
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
  * allows directly creating and inspecting the domains. Domains that have kernel
@@ -252,6 +259,7 @@ struct iommufd_hw_pagetable {
 	bool msi_cookie : 1;
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
+	struct hw_pgtable_fault *fault;
 };
 
 struct iommufd_hw_pagetable *
@@ -314,6 +322,10 @@ struct iommufd_device {
 	bool has_user_data;
 };
 
+struct iommufd_fault_cookie {
+	struct iommufd_device *idev;
+};
+
 static inline struct iommufd_device *
 iommufd_get_device(struct iommufd_ucmd *ucmd, u32 id)
 {
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 29b212714e2c..3408f1fc3e9f 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -374,6 +374,44 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
 	return 0;
 }
 
+static int iommufd_device_set_fault_cookie(struct iommufd_hw_pagetable *hwpt,
+					   struct iommufd_device *idev,
+					   ioasid_t pasid)
+{
+	struct iommufd_fault_cookie *fcookie, *curr;
+
+	if (!hwpt->fault)
+		return 0;
+
+	fcookie = kzalloc(sizeof(*fcookie), GFP_KERNEL);
+	if (!fcookie)
+		return -ENOMEM;
+	fcookie->idev = idev;
+
+	curr = iommu_set_device_fault_cookie(idev->dev, pasid, fcookie);
+	if (IS_ERR(curr)) {
+		kfree(fcookie);
+		return PTR_ERR(curr);
+	}
+	kfree(curr);
+
+	return 0;
+}
+
+static void iommufd_device_unset_fault_cookie(struct iommufd_hw_pagetable *hwpt,
+					      struct iommufd_device *idev,
+					      ioasid_t pasid)
+{
+	struct iommufd_fault_cookie *curr;
+
+	if (!hwpt->fault)
+		return;
+
+	curr = iommu_set_device_fault_cookie(idev->dev, pasid, NULL);
+	WARN_ON(IS_ERR(curr));
+	kfree(curr);
+}
+
 int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 				struct iommufd_device *idev)
 {
@@ -398,6 +436,10 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 	if (rc)
 		goto err_unlock;
 
+	rc = iommufd_device_set_fault_cookie(hwpt, idev, 0);
+	if (rc)
+		goto err_unresv;
+
 	/*
 	 * Only attach to the group once for the first device that is in the
 	 * group. All the other devices will follow this attachment. The user
@@ -408,17 +450,21 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 	if (list_empty(&idev->igroup->device_list)) {
 		rc = iommufd_group_setup_msi(idev->igroup, hwpt);
 		if (rc)
-			goto err_unresv;
+			goto err_unset;
 
 		rc = iommu_attach_group(hwpt->domain, idev->igroup->group);
 		if (rc)
-			goto err_unresv;
+			goto err_unset;
 		idev->igroup->hwpt = hwpt;
 	}
+
 	refcount_inc(&hwpt->obj.users);
 	list_add_tail(&idev->group_item, &idev->igroup->device_list);
 	mutex_unlock(&idev->igroup->lock);
 	return 0;
+
+err_unset:
+	iommufd_device_unset_fault_cookie(hwpt, idev, 0);
 err_unresv:
 	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
 err_unlock:
@@ -433,6 +479,7 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev)
 
 	mutex_lock(&idev->igroup->lock);
 	list_del(&idev->group_item);
+	iommufd_device_unset_fault_cookie(hwpt, idev, 0);
 	if (list_empty(&idev->igroup->device_list)) {
 		iommu_detach_group(hwpt->domain, idev->igroup->group);
 		idev->igroup->hwpt = NULL;
@@ -502,9 +549,14 @@ iommufd_device_do_replace(struct iommufd_device *idev,
 	if (rc)
 		goto err_unresv;
 
+	iommufd_device_unset_fault_cookie(old_hwpt, idev, 0);
+	rc = iommufd_device_set_fault_cookie(hwpt, idev, 0);
+	if (rc)
+		goto err_unresv;
+
 	rc = iommu_group_replace_domain(igroup->group, hwpt->domain);
 	if (rc)
-		goto err_unresv;
+		goto err_replace;
 
 	if (hwpt->ioas != old_hwpt->ioas) {
 		list_for_each_entry(cur, &igroup->device_list, group_item)
@@ -526,6 +578,9 @@ iommufd_device_do_replace(struct iommufd_device *idev,
 
 	/* Caller must destroy old_hwpt */
 	return old_hwpt;
+err_replace:
+	iommufd_device_unset_fault_cookie(hwpt, idev, 0);
+	iommufd_device_set_fault_cookie(old_hwpt, idev, 0);
 err_unresv:
 	list_for_each_entry(cur, &igroup->device_list, group_item)
 		iopt_remove_reserved_iova(&hwpt->ioas->iopt, cur->dev);
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 47ec7ddd5f5d..d6d550c3d0cc 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -3,12 +3,16 @@
  * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
  */
 #include <linux/iommu.h>
+#include <linux/eventfd.h>
 #include <uapi/linux/iommufd.h>
 
 #include "../iommu-priv.h"
 #include "iommufd_private.h"
 #include "iommufd_test.h"
 
+static struct hw_pgtable_fault *hw_pagetable_fault_alloc(int eventfd);
+static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault);
+
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
 {
 	struct iommufd_hw_pagetable *hwpt =
@@ -27,6 +31,9 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
 
 	if (hwpt->parent)
 		refcount_dec(&hwpt->parent->obj.users);
+
+	if (hwpt->fault)
+		hw_pagetable_fault_free(hwpt->fault);
 	refcount_dec(&hwpt->ioas->obj.users);
 }
 
@@ -255,6 +262,11 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 		goto out_put_pt;
 	}
 
+	if (!parent && (cmd->flags & IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE)) {
+		rc = -EINVAL;
+		goto out_put_pt;
+	}
+
 	if (klen) {
 		if (!cmd->data_len) {
 			rc = -EINVAL;
@@ -282,6 +294,14 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 		goto out_unlock;
 	}
 
+	if (cmd->flags & IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE) {
+		hwpt->fault = hw_pagetable_fault_alloc(cmd->event_fd);
+		if (IS_ERR(hwpt->fault)) {
+			rc = PTR_ERR(hwpt->fault);
+			goto out_hwpt;
+		}
+	}
+
 	cmd->out_hwpt_id = hwpt->obj.id;
 	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
 	if (rc)
@@ -346,3 +366,38 @@ int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(&hwpt->obj);
 	return rc;
 }
+
+static struct hw_pgtable_fault *hw_pagetable_fault_alloc(int eventfd)
+{
+	struct hw_pgtable_fault *fault;
+	int rc;
+
+	fault = kzalloc(sizeof(*fault), GFP_KERNEL);
+	if (!fault)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&fault->deliver);
+	INIT_LIST_HEAD(&fault->response);
+	mutex_init(&fault->mutex);
+
+	fault->trigger = eventfd_ctx_fdget(eventfd);
+	if (IS_ERR(fault->trigger)) {
+		rc = PTR_ERR(fault->trigger);
+		goto out_free;
+	}
+
+	return fault;
+
+out_free:
+	kfree(fault);
+	return ERR_PTR(rc);
+}
+
+static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
+{
+	WARN_ON(!list_empty(&fault->deliver));
+	WARN_ON(!list_empty(&fault->response));
+
+	eventfd_ctx_put(fault->trigger);
+	kfree(fault);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 09/17] iommufd: Add iommufd hwpt iopf handler
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (7 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 08/17] iommufd: IO page fault delivery initialization and release Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 10/17] iommufd: Add IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE for hwpt_alloc Lu Baolu
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

The IOPF handler is responsible for delivering I/O page faults to user
space. When an I/O page fault occurs, the fault is placed in the fault
pending list of the hardware page table (HWPT). The HWPT then generates
a fault event, which is used to notify user space of the fault. User
space can then fetch the fault information from the HWPT and handle the
fault accordingly.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  8 ++++
 drivers/iommu/iommufd/hw_pagetable.c    | 50 +++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 5ff139acc5c0..8ff7721ea922 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -243,6 +243,14 @@ struct hw_pgtable_fault {
 	struct eventfd_ctx *trigger;
 };
 
+struct iommufd_fault {
+	struct device *dev;
+	ioasid_t pasid;
+	struct iommu_hwpt_pgfault fault;
+	/* List head at hw_pgtable_fault:deliver or response */
+	struct list_head item;
+};
+
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
  * allows directly creating and inspecting the domains. Domains that have kernel
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index d6d550c3d0cc..4d07c7c0073e 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -12,6 +12,9 @@
 
 static struct hw_pgtable_fault *hw_pagetable_fault_alloc(int eventfd);
 static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault);
+static enum iommu_page_response_code
+iommufd_hw_pagetable_iopf_handler(struct iommu_fault *fault,
+				  struct device *dev, void *data);
 
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
 {
@@ -300,6 +303,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 			rc = PTR_ERR(hwpt->fault);
 			goto out_hwpt;
 		}
+
+		iommu_domain_set_iopf_handler(hwpt->domain,
+					      iommufd_hw_pagetable_iopf_handler,
+					      hwpt);
 	}
 
 	cmd->out_hwpt_id = hwpt->obj.id;
@@ -367,6 +374,49 @@ int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
 	return rc;
 }
 
+static void iommufd_compose_fault_message(struct iommu_fault *fault,
+					  struct iommu_hwpt_pgfault *hwpt_fault,
+					  unsigned int dev_id)
+{
+	hwpt_fault->size = sizeof(*hwpt_fault);
+	hwpt_fault->flags = fault->prm.flags;
+	hwpt_fault->dev_id = dev_id;
+	hwpt_fault->pasid = fault->prm.pasid;
+	hwpt_fault->grpid = fault->prm.grpid;
+	hwpt_fault->perm = fault->prm.perm;
+	hwpt_fault->addr = fault->prm.addr;
+	hwpt_fault->private_data[0] = fault->prm.private_data[0];
+	hwpt_fault->private_data[1] = fault->prm.private_data[1];
+}
+
+static enum iommu_page_response_code
+iommufd_hw_pagetable_iopf_handler(struct iommu_fault *fault,
+				  struct device *dev, void *data)
+{
+	struct iommufd_hw_pagetable *hwpt = data;
+	struct iommufd_fault_cookie *cookie;
+	struct iommufd_fault *ifault;
+
+	ifault = kzalloc(sizeof(*ifault), GFP_KERNEL);
+	if (!ifault)
+		return IOMMU_PAGE_RESP_FAILURE;
+
+	cookie = iommu_get_device_fault_cookie(dev, fault->prm.pasid);
+	if (!cookie)
+		return IOMMU_PAGE_RESP_FAILURE;
+
+	iommufd_compose_fault_message(fault, &ifault->fault, cookie->idev->obj.id);
+	ifault->dev = dev;
+	ifault->pasid = fault->prm.pasid;
+
+	mutex_lock(&hwpt->fault->mutex);
+	list_add_tail(&ifault->item, &hwpt->fault->deliver);
+	eventfd_signal(hwpt->fault->trigger, 1);
+	mutex_unlock(&hwpt->fault->mutex);
+
+	return IOMMU_PAGE_RESP_ASYNC;
+}
+
 static struct hw_pgtable_fault *hw_pagetable_fault_alloc(int eventfd)
 {
 	struct hw_pgtable_fault *fault;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 10/17] iommufd: Add IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE for hwpt_alloc
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (8 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 09/17] iommufd: Add iommufd hwpt iopf handler Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 11/17] iommufd: Deliver fault messages to user space Lu Baolu
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

This flag indicates that the architecture supports assigning the whole
PASID table for a device to userspace. When this flag is set, the host
kernel does not need to be involved in attaching or detaching HWPTs to
any PASID of the device. For such architectures, the fault cookie is
always saved as {device, 0}.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommufd/iommufd_private.h | 1 +
 include/uapi/linux/iommufd.h            | 3 +++
 drivers/iommu/iommufd/hw_pagetable.c    | 6 +++++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 8ff7721ea922..67e5aa0f996e 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -241,6 +241,7 @@ struct hw_pgtable_fault {
 	struct list_head deliver;
 	struct list_head response;
 	struct eventfd_ctx *trigger;
+	bool user_pasid_table;
 };
 
 struct iommufd_fault {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2c7c44c00da2..63863e21d043 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -449,6 +449,8 @@ struct iommu_hwpt_arm_smmuv3 {
  *    must be valid once this flag is set. On successful return, user can
  *    listen to @event_fd and retrieve faults by reading @out_fault_fd.
  *    The fault data is encoded in the format defined by iommu_hwpt_pgfault.
+ *  - USER_PASID_TABLE: The architecture supports assigning the whole pasid
+ *    table of a device to user.
  * @dev_id: The device to allocate this HWPT for
  * @pt_id: The IOAS to connect this HWPT to
  * @out_hwpt_id: The ID of the new HWPT
@@ -487,6 +489,7 @@ struct iommu_hwpt_alloc {
 	__u32 size;
 	__u32 flags;
 #define IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE		(1 << 0)
+#define IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE		(1 << 1)
 	__u32 dev_id;
 	__u32 pt_id;
 	__u32 out_hwpt_id;
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 4d07c7c0073e..ca3e4d92f2aa 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -304,6 +304,9 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 			goto out_hwpt;
 		}
 
+		if (cmd->flags & IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE)
+			hwpt->fault->user_pasid_table = true;
+
 		iommu_domain_set_iopf_handler(hwpt->domain,
 					      iommufd_hw_pagetable_iopf_handler,
 					      hwpt);
@@ -401,7 +404,8 @@ iommufd_hw_pagetable_iopf_handler(struct iommu_fault *fault,
 	if (!ifault)
 		return IOMMU_PAGE_RESP_FAILURE;
 
-	cookie = iommu_get_device_fault_cookie(dev, fault->prm.pasid);
+	cookie = iommu_get_device_fault_cookie(dev,
+			hwpt->fault->user_pasid_table ?  0 : fault->prm.pasid);
 	if (!cookie)
 		return IOMMU_PAGE_RESP_FAILURE;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 11/17] iommufd: Deliver fault messages to user space
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (9 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 10/17] iommufd: Add IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE for hwpt_alloc Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 12/17] iommufd: Add io page fault response support Lu Baolu
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

Provide a read-only file interface that allows user space to obtain fault
messages by sequentially reading the file. User space can determine whether
all fault messages have been read by comparing the provided read buffer
with the actually returned data length. Once a fault is read by the user,
it will be moved from the pending list to the waiting-for-response list.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  2 +
 drivers/iommu/iommufd/hw_pagetable.c    | 66 +++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 67e5aa0f996e..6da0ba9421d0 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -242,6 +242,8 @@ struct hw_pgtable_fault {
 	struct list_head response;
 	struct eventfd_ctx *trigger;
 	bool user_pasid_table;
+	struct file *fault_file;
+	int fault_fd;
 };
 
 struct iommufd_fault {
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index ca3e4d92f2aa..09377a98069b 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -4,6 +4,8 @@
  */
 #include <linux/iommu.h>
 #include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
 #include <uapi/linux/iommufd.h>
 
 #include "../iommu-priv.h"
@@ -310,6 +312,8 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 		iommu_domain_set_iopf_handler(hwpt->domain,
 					      iommufd_hw_pagetable_iopf_handler,
 					      hwpt);
+
+		cmd->out_fault_fd = hwpt->fault->fault_fd;
 	}
 
 	cmd->out_hwpt_id = hwpt->obj.id;
@@ -421,6 +425,62 @@ iommufd_hw_pagetable_iopf_handler(struct iommu_fault *fault,
 	return IOMMU_PAGE_RESP_ASYNC;
 }
 
+static ssize_t hwpt_fault_fops_read(struct file *filep, char __user *buf,
+				    size_t count, loff_t *ppos)
+{
+	struct hw_pgtable_fault *fault = filep->private_data;
+	size_t fault_size = sizeof(struct iommu_fault);
+	struct iommufd_fault *ifault;
+	size_t done = 0;
+
+	if (ppos || count % fault_size)
+		return -ESPIPE;
+
+	mutex_lock(&fault->mutex);
+	while (!list_empty(&fault->deliver) && count > done) {
+		ifault = list_first_entry(&fault->deliver, struct iommufd_fault, item);
+		if (copy_to_user(buf + done, &ifault->fault, fault_size))
+			break;
+		done += fault_size;
+		list_del_init(&ifault->item);
+		if (ifault->fault.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)
+			list_add_tail(&ifault->item, &fault->response);
+		else
+			kfree(ifault);
+	}
+	mutex_unlock(&fault->mutex);
+
+	return done;
+}
+
+static const struct file_operations hwpt_fault_fops = {
+	.owner		= THIS_MODULE,
+	.read		= hwpt_fault_fops_read,
+};
+
+static int hw_pagetable_get_fault_fd(struct hw_pgtable_fault *fault)
+{
+	struct file *filep;
+	int fdno;
+
+	fdno = get_unused_fd_flags(O_CLOEXEC);
+	if (fdno < 0)
+		return fdno;
+
+	filep = anon_inode_getfile("[iommufd-pgfault]", &hwpt_fault_fops,
+				   fault, O_RDONLY);
+	if (IS_ERR(filep)) {
+		put_unused_fd(fdno);
+		return PTR_ERR(filep);
+	}
+
+	fd_install(fdno, filep);
+	fault->fault_file = filep;
+	fault->fault_fd = fdno;
+
+	return 0;
+}
+
 static struct hw_pgtable_fault *hw_pagetable_fault_alloc(int eventfd)
 {
 	struct hw_pgtable_fault *fault;
@@ -440,8 +500,14 @@ static struct hw_pgtable_fault *hw_pagetable_fault_alloc(int eventfd)
 		goto out_free;
 	}
 
+	rc = hw_pagetable_get_fault_fd(fault);
+	if (rc)
+		goto out_put_eventfd;
+
 	return fault;
 
+out_put_eventfd:
+	eventfd_ctx_put(fault->trigger);
 out_free:
 	kfree(fault);
 	return ERR_PTR(rc);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 12/17] iommufd: Add io page fault response support
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (10 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 11/17] iommufd: Deliver fault messages to user space Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 13/17] iommufd: Add a timer for each iommufd fault data Lu Baolu
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

Externd the IOMMUFD framework to provide a user space API for responding
to page faults.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  1 +
 include/uapi/linux/iommufd.h            | 23 +++++++++++
 drivers/iommu/iommufd/hw_pagetable.c    | 54 +++++++++++++++++++++++++
 drivers/iommu/iommufd/main.c            |  3 ++
 4 files changed, 81 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 6da0ba9421d0..0985e83a611f 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -288,6 +288,7 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 void iommufd_hw_pagetable_abort(struct iommufd_object *obj);
 int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd);
 int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd);
+int iommufd_hwpt_page_response(struct iommufd_ucmd *ucmd);
 
 static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 					    struct iommufd_hw_pagetable *hwpt)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 63863e21d043..65bb856dd8fb 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -50,6 +50,7 @@ enum {
 	IOMMUFD_CMD_HWPT_INVALIDATE,
 	IOMMUFD_CMD_DEVICE_SET_DATA,
 	IOMMUFD_CMD_DEVICE_UNSET_DATA,
+	IOMMUFD_CMD_PAGE_RESPONSE,
 };
 
 /**
@@ -779,4 +780,26 @@ struct iommu_device_unset_data {
 	__u32 dev_id;
 };
 #define IOMMU_DEVICE_UNSET_DATA _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DEVICE_UNSET_DATA)
+
+/**
+ * struct iommu_hwpt_page_response - ioctl(IOMMUFD_CMD_PAGE_RESPONSE)
+ * @size: sizeof(struct iommu_hwpt_page_response)
+ * @flags: encodes whether the corresponding fields are valid
+ *         (IOMMU_PGFAULT_FLAGS_* values)
+ * @hwpt_id: hwpt ID of target hardware page table for the response
+ * @dev_id: device ID of target device for the response
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @code: response code from &enum iommu_page_response_code
+ */
+struct iommu_hwpt_page_response {
+	__u32 size;
+	__u32 flags;
+	__u32 hwpt_id;
+	__u32 dev_id;
+	__u32 pasid;
+	__u32 grpid;
+	__u32 code;
+};
+#define IOMMU_PAGE_RESPONSE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_PAGE_RESPONSE)
 #endif
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 09377a98069b..c1f3ebdce796 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -521,3 +521,57 @@ static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
 	eventfd_ctx_put(fault->trigger);
 	kfree(fault);
 }
+
+int iommufd_hwpt_page_response(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_page_response *cmd = ucmd->cmd;
+	struct iommu_page_response resp = {};
+	struct iommufd_fault *curr, *next;
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_device *idev;
+	int rc = -EINVAL;
+
+	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt))
+		return rc;
+
+	if (!hwpt->parent || !hwpt->fault)
+		goto out_put_hwpt;
+
+	idev = iommufd_get_device(ucmd, cmd->dev_id);
+	if (IS_ERR(idev))
+		goto out_put_hwpt;
+
+	mutex_lock(&hwpt->fault->mutex);
+	list_for_each_entry_safe(curr, next, &hwpt->fault->response, item) {
+		if (curr->dev != idev->dev || curr->fault.grpid != cmd->grpid)
+			continue;
+
+		if ((cmd->flags & IOMMU_PGFAULT_FLAGS_PASID_VALID) &&
+		    cmd->pasid != curr->fault.pasid)
+			break;
+
+		if ((curr->fault.flags & IOMMU_PGFAULT_FLAGS_RESP_NEEDS_PASID) &&
+		    !(cmd->flags & IOMMU_PGFAULT_FLAGS_PASID_VALID))
+			break;
+
+		resp.version = IOMMU_PAGE_RESP_VERSION_1;
+		resp.pasid = cmd->pasid;
+		resp.grpid = cmd->grpid;
+		resp.code = cmd->code;
+		if (curr->fault.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)
+			resp.flags = IOMMU_PAGE_RESP_PASID_VALID;
+
+		rc = iommu_page_response(idev->dev, &resp);
+		list_del_init(&curr->item);
+		kfree(curr);
+		break;
+	}
+	mutex_unlock(&hwpt->fault->mutex);
+
+	iommufd_put_object(&idev->obj);
+out_put_hwpt:
+	iommufd_put_object(&hwpt->obj);
+
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index bb39dc6f3b27..0c8988808f43 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -279,6 +279,7 @@ union ucmd_buffer {
 	struct iommu_ioas_unmap unmap;
 	struct iommu_option option;
 	struct iommu_vfio_ioas vfio_ioas;
+	struct iommu_hwpt_page_response resp;
 #ifdef CONFIG_IOMMUFD_TEST
 	struct iommu_test_cmd test;
 #endif
@@ -335,6 +336,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 struct iommu_device_set_data, data_len),
 	IOCTL_OP(IOMMU_DEVICE_UNSET_DATA, iommufd_device_unset_data,
 		 struct iommu_device_unset_data, dev_id),
+	IOCTL_OP(IOMMU_PAGE_RESPONSE, iommufd_hwpt_page_response, struct iommu_hwpt_page_response,
+		 code),
 #ifdef CONFIG_IOMMUFD_TEST
 	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 13/17] iommufd: Add a timer for each iommufd fault data
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (11 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 12/17] iommufd: Add io page fault response support Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 14/17] iommufd: Drain all pending faults when destroying hwpt Lu Baolu
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

In case that user space failed to read or respond the pending faults. As
the per-fault iommufd data will be possibly accessed in two different
contexts: user reading/responding and the timer expiring, add a reference
counter for each iommufd fault data and free the data only after all the
reference counters are released.

The page fault response timeout value is device-specific and indicates how
long the bus/device will wait for a response to a page fault request. The
timeout value is added to the per-device fault cookie. Ideally, it should
be calculated according to the platform configuration (PCI, ACPI, device
tree, etc.). This defines a default value of 1 second in case that no
platform opt-in is available. This default value is roughly estimated and
subject to be changed according to real use cases.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  8 +++
 drivers/iommu/iommufd/device.c          |  3 +
 drivers/iommu/iommufd/hw_pagetable.c    | 80 +++++++++++++++++++++++--
 3 files changed, 87 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 0985e83a611f..f5b8a53044c4 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -249,9 +249,12 @@ struct hw_pgtable_fault {
 struct iommufd_fault {
 	struct device *dev;
 	ioasid_t pasid;
+	struct iommufd_hw_pagetable *hwpt;
 	struct iommu_hwpt_pgfault fault;
 	/* List head at hw_pgtable_fault:deliver or response */
 	struct list_head item;
+	struct timer_list timer;
+	refcount_t users;
 };
 
 /*
@@ -336,6 +339,11 @@ struct iommufd_device {
 
 struct iommufd_fault_cookie {
 	struct iommufd_device *idev;
+	/*
+	 * The maximum number of milliseconds that a device will wait for a
+	 * response to a page fault request.
+	 */
+	unsigned long timeout;
 };
 
 static inline struct iommufd_device *
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 3408f1fc3e9f..6ad46638f4e1 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -374,6 +374,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
 	return 0;
 }
 
+#define IOMMUFD_DEFAULT_IOPF_TIMEOUT	1000
+
 static int iommufd_device_set_fault_cookie(struct iommufd_hw_pagetable *hwpt,
 					   struct iommufd_device *idev,
 					   ioasid_t pasid)
@@ -387,6 +389,7 @@ static int iommufd_device_set_fault_cookie(struct iommufd_hw_pagetable *hwpt,
 	if (!fcookie)
 		return -ENOMEM;
 	fcookie->idev = idev;
+	fcookie->timeout = IOMMUFD_DEFAULT_IOPF_TIMEOUT;
 
 	curr = iommu_set_device_fault_cookie(idev->dev, pasid, fcookie);
 	if (IS_ERR(curr)) {
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index c1f3ebdce796..8c441fd72e1f 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -6,6 +6,7 @@
 #include <linux/eventfd.h>
 #include <linux/file.h>
 #include <linux/anon_inodes.h>
+#include <linux/timer.h>
 #include <uapi/linux/iommufd.h>
 
 #include "../iommu-priv.h"
@@ -396,6 +397,60 @@ static void iommufd_compose_fault_message(struct iommu_fault *fault,
 	hwpt_fault->private_data[1] = fault->prm.private_data[1];
 }
 
+static void drain_iopf_fault(struct iommufd_fault *ifault)
+{
+	struct iommu_page_response resp = {
+		.version	= IOMMU_PAGE_RESP_VERSION_1,
+		.pasid		= ifault->fault.pasid,
+		.grpid		= ifault->fault.grpid,
+		.code		= IOMMU_PAGE_RESP_FAILURE,
+	};
+
+	if (!(ifault->fault.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
+		return;
+
+	if ((ifault->fault.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
+	    (ifault->fault.flags & IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID))
+		resp.flags = IOMMU_PAGE_RESP_PASID_VALID;
+
+	iommu_page_response(ifault->dev, &resp);
+}
+
+static void iommufd_put_fault(struct iommufd_fault *ifault)
+{
+	if (!ifault)
+		return;
+
+	if (refcount_dec_and_test(&ifault->users))
+		kfree(ifault);
+}
+
+static int iommufd_fault_timer_teardown(struct iommufd_fault *ifault)
+{
+	int rc;
+
+	rc = timer_delete(&ifault->timer);
+	if (rc)
+		iommufd_put_fault(ifault);
+
+	return rc;
+}
+
+static void iopf_timer_func(struct timer_list *t)
+{
+	struct iommufd_fault *ifault = from_timer(ifault, t, timer);
+	struct hw_pgtable_fault *fault = ifault->hwpt->fault;
+
+	mutex_lock(&fault->mutex);
+	if (!list_empty(&ifault->item)) {
+		list_del_init(&ifault->item);
+		drain_iopf_fault(ifault);
+	}
+	mutex_unlock(&fault->mutex);
+
+	iommufd_put_fault(ifault);
+}
+
 static enum iommu_page_response_code
 iommufd_hw_pagetable_iopf_handler(struct iommu_fault *fault,
 				  struct device *dev, void *data)
@@ -416,6 +471,10 @@ iommufd_hw_pagetable_iopf_handler(struct iommu_fault *fault,
 	iommufd_compose_fault_message(fault, &ifault->fault, cookie->idev->obj.id);
 	ifault->dev = dev;
 	ifault->pasid = fault->prm.pasid;
+	ifault->hwpt = hwpt;
+	refcount_set(&ifault->users, 2);
+	timer_setup(&ifault->timer, iopf_timer_func, 0);
+	mod_timer(&ifault->timer, jiffies + msecs_to_jiffies(cookie->timeout));
 
 	mutex_lock(&hwpt->fault->mutex);
 	list_add_tail(&ifault->item, &hwpt->fault->deliver);
@@ -443,10 +502,12 @@ static ssize_t hwpt_fault_fops_read(struct file *filep, char __user *buf,
 			break;
 		done += fault_size;
 		list_del_init(&ifault->item);
-		if (ifault->fault.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)
+		if (ifault->fault.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE) {
 			list_add_tail(&ifault->item, &fault->response);
-		else
-			kfree(ifault);
+		} else {
+			iommufd_fault_timer_teardown(ifault);
+			iommufd_put_fault(ifault);
+		}
 	}
 	mutex_unlock(&fault->mutex);
 
@@ -526,6 +587,7 @@ int iommufd_hwpt_page_response(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_hwpt_page_response *cmd = ucmd->cmd;
 	struct iommu_page_response resp = {};
+	struct iommufd_fault *ifault = NULL;
 	struct iommufd_fault *curr, *next;
 	struct iommufd_hw_pagetable *hwpt;
 	struct iommufd_device *idev;
@@ -547,6 +609,7 @@ int iommufd_hwpt_page_response(struct iommufd_ucmd *ucmd)
 		if (curr->dev != idev->dev || curr->fault.grpid != cmd->grpid)
 			continue;
 
+		ifault = curr;
 		if ((cmd->flags & IOMMU_PGFAULT_FLAGS_PASID_VALID) &&
 		    cmd->pasid != curr->fault.pasid)
 			break;
@@ -555,6 +618,15 @@ int iommufd_hwpt_page_response(struct iommufd_ucmd *ucmd)
 		    !(cmd->flags & IOMMU_PGFAULT_FLAGS_PASID_VALID))
 			break;
 
+		/*
+		 * The timer has expired if it was not pending. Leave the
+		 * response to the timer function.
+		 */
+		if (!iommufd_fault_timer_teardown(curr)) {
+			rc = -ETIMEDOUT;
+			break;
+		}
+
 		resp.version = IOMMU_PAGE_RESP_VERSION_1;
 		resp.pasid = cmd->pasid;
 		resp.grpid = cmd->grpid;
@@ -564,11 +636,11 @@ int iommufd_hwpt_page_response(struct iommufd_ucmd *ucmd)
 
 		rc = iommu_page_response(idev->dev, &resp);
 		list_del_init(&curr->item);
-		kfree(curr);
 		break;
 	}
 	mutex_unlock(&hwpt->fault->mutex);
 
+	iommufd_put_fault(ifault);
 	iommufd_put_object(&idev->obj);
 out_put_hwpt:
 	iommufd_put_object(&hwpt->obj);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 14/17] iommufd: Drain all pending faults when destroying hwpt
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (12 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 13/17] iommufd: Add a timer for each iommufd fault data Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 15/17] iommufd: Allow new hwpt_alloc flags Lu Baolu
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

When a HWPT is unexpectedly destroyed, drain all faults in the pending
lists. It is safe because the iommu domain has been released and there
will never be new io page faults anymore.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommufd/hw_pagetable.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 8c441fd72e1f..7f18e6bd76ec 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -574,11 +574,26 @@ static struct hw_pgtable_fault *hw_pagetable_fault_alloc(int eventfd)
 	return ERR_PTR(rc);
 }
 
+static void iommufd_fault_list_destroy(struct hw_pgtable_fault *fault,
+				       struct list_head *list)
+{
+	struct iommufd_fault *ifault;
+
+	mutex_lock(&fault->mutex);
+	while (!list_empty(list)) {
+		ifault = list_first_entry(list, struct iommufd_fault, item);
+		if (iommufd_fault_timer_teardown(ifault))
+			drain_iopf_fault(ifault);
+		list_del_init(&ifault->item);
+		iommufd_put_fault(ifault);
+	}
+	mutex_unlock(&fault->mutex);
+}
+
 static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
 {
-	WARN_ON(!list_empty(&fault->deliver));
-	WARN_ON(!list_empty(&fault->response));
-
+	iommufd_fault_list_destroy(fault, &fault->deliver);
+	iommufd_fault_list_destroy(fault, &fault->response);
 	eventfd_ctx_put(fault->trigger);
 	kfree(fault);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 15/17] iommufd: Allow new hwpt_alloc flags
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (13 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 14/17] iommufd: Drain all pending faults when destroying hwpt Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 16/17] iommufd/selftest: Add IOPF feature for mock devices Lu Baolu
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

We have completed all puzzles for IO page fault delivery so far. We can
allow user space to opt-in the flags of IOMMU_HWPT_ALLOC_FLAGS_IOPF and
IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE now.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/uapi/linux/iommufd.h         | 3 +++
 drivers/iommu/iommufd/hw_pagetable.c | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 65bb856dd8fb..908d12219727 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -491,6 +491,9 @@ struct iommu_hwpt_alloc {
 	__u32 flags;
 #define IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE		(1 << 0)
 #define IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE		(1 << 1)
+#define IOMMU_HWPT_ALLOC_FLAGS_ALL				\
+		(IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE |		\
+		 IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE)
 	__u32 dev_id;
 	__u32 pt_id;
 	__u32 out_hwpt_id;
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 7f18e6bd76ec..7be6bf26290f 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -208,7 +208,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 	int klen = 0;
 	int rc = 0;
 
-	if (cmd->flags || cmd->__reserved)
+	if ((cmd->flags & ~IOMMU_HWPT_ALLOC_FLAGS_ALL) || cmd->__reserved)
 		return -EOPNOTSUPP;
 
 	idev = iommufd_get_device(ucmd, cmd->dev_id);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 16/17] iommufd/selftest: Add IOPF feature for mock devices
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (14 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 15/17] iommufd: Allow new hwpt_alloc flags Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30  5:37 ` [RFC PATCHES 17/17] iommufd/selftest: Cover iopf-capable nested hwpt Lu Baolu
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

So that we can test the delilvery of IO page faults through IOMMU with
the selftest infrastructure.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommufd/selftest.c | 71 ++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index debf2d588990..d3d3342e95b6 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -14,6 +14,7 @@
 #include "io_pagetable.h"
 #include "iommufd_private.h"
 #include "iommufd_test.h"
+#include "../io-pgfault.h"
 
 static DECLARE_FAULT_ATTR(fail_iommufd);
 static struct dentry *dbgfs_root;
@@ -96,6 +97,8 @@ enum selftest_obj_type {
 struct mock_dev {
 	struct device dev;
 	u32 dev_data;
+	unsigned char iopfq_name[16];
+	struct iopf_queue *iopf_queue;
 };
 
 struct selftest_obj {
@@ -360,6 +363,64 @@ static int mock_domain_user_data_len(u32 hwpt_type)
 	return sizeof(struct iommu_hwpt_selftest);
 };
 
+static int mock_dev_enable_feat(struct device *dev, enum iommu_dev_features feat)
+{
+	struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
+	struct iommu_group *group;
+	int rc;
+
+	if (feat != IOMMU_DEV_FEAT_IOPF)
+		return -EOPNOTSUPP;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return -ENODEV;
+
+	/* Allocate the iopf queue: */
+	snprintf(mdev->iopfq_name, sizeof(mdev->iopfq_name),
+		 "mock%d-iopfq", iommu_group_id(group));
+	mdev->iopf_queue = iopf_queue_alloc(mdev->iopfq_name);
+	if (!mdev->iopf_queue) {
+		rc = -ENOMEM;
+		goto err_put_group;
+	}
+
+	/* Register I/O page fault: */
+	rc = iopf_queue_add_device(mdev->iopf_queue, &mdev->dev);
+	if (rc)
+		goto err_free_queue;
+	rc = iommu_register_device_fault_handler(&mdev->dev, iommu_queue_iopf,
+						 &mdev->dev);
+	if (rc)
+		goto err_remove_device;
+
+	iommu_group_put(group);
+
+	return 0;
+
+err_remove_device:
+	iopf_queue_remove_device(mdev->iopf_queue, &mdev->dev);
+err_free_queue:
+	iopf_queue_free(mdev->iopf_queue);
+err_put_group:
+	iommu_group_put(group);
+	return rc;
+}
+
+static int mock_dev_disable_feat(struct device *dev, enum iommu_dev_features feat)
+{
+	struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
+
+	if (feat != IOMMU_DEV_FEAT_IOPF)
+		return -EOPNOTSUPP;
+
+	iommu_unregister_device_fault_handler(dev);
+	iopf_queue_remove_device(mdev->iopf_queue, dev);
+	iopf_queue_free(mdev->iopf_queue);
+
+	return 0;
+}
+
 static const struct iommu_ops mock_ops = {
 	.owner = THIS_MODULE,
 	.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
@@ -373,6 +434,8 @@ static const struct iommu_ops mock_ops = {
 	.set_dev_user_data = mock_domain_set_dev_user_data,
 	.unset_dev_user_data = mock_domain_unset_dev_user_data,
 	.dev_user_data_len = sizeof(struct iommu_test_device_data),
+	.dev_enable_feat = mock_dev_enable_feat,
+	.dev_disable_feat = mock_dev_disable_feat,
 	.default_domain_ops =
 		&(struct iommu_domain_ops){
 			.free = mock_domain_free,
@@ -494,9 +557,16 @@ static struct mock_dev *mock_dev_create(void)
 	rc = iommu_group_add_device(iommu_group, &mdev->dev);
 	if (rc)
 		goto err_del;
+
+	rc = iommu_dev_enable_feature(&mdev->dev, IOMMU_DEV_FEAT_IOPF);
+	if (rc)
+		goto err_remove;
+
 	iommu_group_put(iommu_group);
 	return mdev;
 
+err_remove:
+	iommu_group_remove_device(&mdev->dev);
 err_del:
 	device_del(&mdev->dev);
 err_dev_iommu:
@@ -511,6 +581,7 @@ static struct mock_dev *mock_dev_create(void)
 
 static void mock_dev_destroy(struct mock_dev *mdev)
 {
+	iommu_dev_disable_feature(&mdev->dev, IOMMU_DEV_FEAT_IOPF);
 	iommu_group_remove_device(&mdev->dev);
 	device_del(&mdev->dev);
 	kfree(mdev->dev.iommu);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCHES 17/17] iommufd/selftest: Cover iopf-capable nested hwpt
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (15 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 16/17] iommufd/selftest: Add IOPF feature for mock devices Lu Baolu
@ 2023-05-30  5:37 ` Lu Baolu
  2023-05-30 18:50 ` [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Nicolin Chen
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Lu Baolu @ 2023-05-30  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Nicolin Chen, Yi Liu,
	Jacob Pan
  Cc: iommu, linux-kselftest, virtualization, linux-kernel, Lu Baolu

The coverage includes operations to allocate, destroy, and replace an
iopf-capable nested HWPT.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 tools/testing/selftests/iommu/iommufd_utils.h | 20 ++++++++++++++++---
 tools/testing/selftests/iommu/iommufd.c       | 17 +++++++++++++++-
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 487d45c29c6d..613ee7ef8af8 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -137,7 +137,8 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id,
 	})
 
 static int _test_cmd_hwpt_alloc_nested(int fd, __u32 device_id, __u32 parent_id,
-				       __u32 *hwpt_id)
+				       __u32 event_fd, __u32 *hwpt_id,
+				       __u32 *out_fault_fd)
 {
 	struct iommu_hwpt_selftest data = {
 		.flags = IOMMU_TEST_FLAG_NESTED,
@@ -153,21 +154,34 @@ static int _test_cmd_hwpt_alloc_nested(int fd, __u32 device_id, __u32 parent_id,
 	};
 	int ret;
 
+	if (out_fault_fd) {
+		cmd.event_fd = event_fd;
+		cmd.flags |= (IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE |
+			      IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE);
+	}
+
 	ret = ioctl(fd, IOMMU_HWPT_ALLOC, &cmd);
 	if (ret)
 		return ret;
 	if (hwpt_id)
 		*hwpt_id = cmd.out_hwpt_id;
+	if (out_fault_fd)
+		*out_fault_fd = cmd.out_fault_fd;
 	return 0;
 }
 
 #define test_cmd_hwpt_alloc_nested(device_id, parent_id, hwpt_id)     \
 	ASSERT_EQ(0, _test_cmd_hwpt_alloc_nested(self->fd, device_id, \
-						 parent_id, hwpt_id))
+						 parent_id, 0, hwpt_id, NULL))
+#define test_cmd_hwpt_alloc_iopf(device_id, parent_id, event_fd,	\
+				 hwpt_id, out_fault_fd)			\
+	ASSERT_EQ(0, _test_cmd_hwpt_alloc_nested(self->fd, device_id,	\
+						 parent_id, event_fd,	\
+						 hwpt_id, out_fault_fd))
 #define test_err_cmd_hwpt_alloc_nested(_errno, device_id, parent_id, hwpt_id) \
 	EXPECT_ERRNO(_errno,                                                  \
 		     _test_cmd_hwpt_alloc_nested(self->fd, device_id,         \
-						 parent_id, hwpt_id))
+						 parent_id, 0, hwpt_id, NULL))
 
 static int _test_cmd_hwpt_invalidate(int fd, __u32 hwpt_id)
 {
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 2987e8603418..6bf99172a8e9 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -294,7 +294,9 @@ TEST_F(iommufd_ioas, nested_hwpt_alloc)
 {
 	uint32_t nested_hwpt_id[2] = {};
 	uint32_t parent_hwpt_id = 0;
+	uint32_t event_fd, fault_fd;
 	uint32_t test_hwpt_id = 0;
+	uint32_t iopf_hwpt_id = 0;
 
 	if (self->device_id) {
 		/* Negative tests */
@@ -316,6 +318,12 @@ TEST_F(iommufd_ioas, nested_hwpt_alloc)
 		test_cmd_hwpt_check_iotlb(nested_hwpt_id[1],
 					  IOMMU_TEST_IOTLB_DEFAULT);
 
+		/* Allocate and destroy iopf capable nested hwpt */
+		event_fd = eventfd(0, EFD_CLOEXEC);
+		ASSERT_NE(-1, event_fd);
+		test_cmd_hwpt_alloc_iopf(self->device_id, parent_hwpt_id,
+					 event_fd, &iopf_hwpt_id, &fault_fd);
+
 		/* Negative test: a nested hwpt on top of a nested hwpt */
 		test_err_cmd_hwpt_alloc_nested(EINVAL, self->device_id,
 					       nested_hwpt_id[0],
@@ -344,9 +352,16 @@ TEST_F(iommufd_ioas, nested_hwpt_alloc)
 			     _test_ioctl_destroy(self->fd, nested_hwpt_id[1]));
 		test_ioctl_destroy(nested_hwpt_id[0]);
 
+		/* Switch from nested_hwpt_id[1] to iopf hwpt */
+		test_cmd_mock_domain_replace(self->stdev_id, iopf_hwpt_id);
+		EXPECT_ERRNO(EBUSY,
+			     _test_ioctl_destroy(self->fd, iopf_hwpt_id));
+		test_ioctl_destroy(nested_hwpt_id[1]);
+
 		/* Detach from nested_hwpt_id[1] and destroy it */
 		test_cmd_mock_domain_replace(self->stdev_id, parent_hwpt_id);
-		test_ioctl_destroy(nested_hwpt_id[1]);
+		test_ioctl_destroy(iopf_hwpt_id);
+		close(event_fd);
 
 		/* Detach from the parent hw_pagetable and destroy it */
 		test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (16 preceding siblings ...)
  2023-05-30  5:37 ` [RFC PATCHES 17/17] iommufd/selftest: Cover iopf-capable nested hwpt Lu Baolu
@ 2023-05-30 18:50 ` Nicolin Chen
  2023-05-31  2:10   ` Baolu Lu
  2023-06-25  6:30   ` Baolu Lu
  2023-05-31  0:33 ` Jason Gunthorpe
  2023-06-16 11:32 ` Jean-Philippe Brucker
  19 siblings, 2 replies; 37+ messages in thread
From: Nicolin Chen @ 2023-05-30 18:50 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

Hi Baolu,

On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
 
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework. The use case is nested
> translation, where modern IOMMU hardware supports two-stage translation
> tables. The second-stage translation table is managed by the host VMM
> while the first-stage translation table is owned by the user space.
> Hence, any IO page fault that occurs on the first-stage page table
> should be delivered to the user space and handled there. The user space
> should respond the page fault handling result to the device top-down
> through the IOMMUFD response uAPI.
> 
> User space indicates its capablity of handling IO page faults by setting
> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
> will then setup its infrastructure for page fault delivery. Together
> with the iopf-capable flag, user space should also provide an eventfd
> where it will listen on any down-top page fault messages.
> 
> On a successful return of the allocation of iopf-capable HWPT, a fault
> fd will be returned. User space can open and read fault messages from it
> once the eventfd is signaled.

I think that, whether the guest has an IOPF capability or not,
the host should always forward any stage-1 fault/error back to
the guest. Yet, the implementation of this series builds with
the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV.

And I have my doubt at the using the IOPF framework with that
IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for
its bottom half workqueue, because a page response could take
a long cycle. But adding that flag feels like we don't really
need the bottom half workqueue, i.e. losing the point of using
the IOPF framework, IMHO.

Combining the two facts above, I wonder if we really need to
go through the IOPF framework; can't we just register a user
fault handler in the iommufd directly upon a valid event_fd?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (17 preceding siblings ...)
  2023-05-30 18:50 ` [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Nicolin Chen
@ 2023-05-31  0:33 ` Jason Gunthorpe
  2023-05-31  3:17   ` Baolu Lu
  2023-06-23  6:18   ` Baolu Lu
  2023-06-16 11:32 ` Jean-Philippe Brucker
  19 siblings, 2 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2023-05-31  0:33 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Kevin Tian, Joerg Roedel, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Nicolin Chen, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
> Hi folks,
> 
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework. The use case is nested
> translation, where modern IOMMU hardware supports two-stage translation
> tables. The second-stage translation table is managed by the host VMM
> while the first-stage translation table is owned by the user space.
> Hence, any IO page fault that occurs on the first-stage page table
> should be delivered to the user space and handled there. The user space
> should respond the page fault handling result to the device top-down
> through the IOMMUFD response uAPI.
> 
> User space indicates its capablity of handling IO page faults by setting
> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
> will then setup its infrastructure for page fault delivery. Together
> with the iopf-capable flag, user space should also provide an eventfd
> where it will listen on any down-top page fault messages.
> 
> On a successful return of the allocation of iopf-capable HWPT, a fault
> fd will be returned. User space can open and read fault messages from it
> once the eventfd is signaled.

This is a performance path so we really need to think about this more,
polling on an eventfd and then reading a different fd is not a good
design.

What I would like is to have a design from the start that fits into
io_uring, so we can have pre-posted 'recvs' in io_uring that just get
completed at high speed when PRIs come in.

This suggests that the PRI should be delivered via read() on a single
FD and pollability on the single FD without any eventfd.

> Besides the overall design, I'd like to hear comments about below
> designs:
> 
> - The IOMMUFD fault message format. It is very similar to that in
>   uapi/linux/iommu which has been discussed before and partially used by
>   the IOMMU SVA implementation. I'd like to get more comments on the
>   format when it comes to IOMMUFD.

We have to have the same discussion as always, does a generic fault
message format make any sense here?

PRI seems more likely that it would but it needs a big carefull cross
vendor check out.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-05-30 18:50 ` [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Nicolin Chen
@ 2023-05-31  2:10   ` Baolu Lu
  2023-05-31  4:12     ` Nicolin Chen
  2023-06-25  6:30   ` Baolu Lu
  1 sibling, 1 reply; 37+ messages in thread
From: Baolu Lu @ 2023-05-31  2:10 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: baolu.lu, Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On 5/31/23 2:50 AM, Nicolin Chen wrote:
> Hi Baolu,

Hi Nicolin,

> 
> On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
>   
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework. The use case is nested
>> translation, where modern IOMMU hardware supports two-stage translation
>> tables. The second-stage translation table is managed by the host VMM
>> while the first-stage translation table is owned by the user space.
>> Hence, any IO page fault that occurs on the first-stage page table
>> should be delivered to the user space and handled there. The user space
>> should respond the page fault handling result to the device top-down
>> through the IOMMUFD response uAPI.
>>
>> User space indicates its capablity of handling IO page faults by setting
>> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
>> will then setup its infrastructure for page fault delivery. Together
>> with the iopf-capable flag, user space should also provide an eventfd
>> where it will listen on any down-top page fault messages.
>>
>> On a successful return of the allocation of iopf-capable HWPT, a fault
>> fd will be returned. User space can open and read fault messages from it
>> once the eventfd is signaled.
> 
> I think that, whether the guest has an IOPF capability or not,
> the host should always forward any stage-1 fault/error back to
> the guest. Yet, the implementation of this series builds with
> the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV.

I agree with you that DMA unrecoverable faults on stage-1 hwpt should
also be reported to user space. However, I have some concerns about how
this will be implemented.

In the shadow page table case, we don't report DMA unrecoverable faults.
This could lead to confusion for users, as they may expect to receive
DMA unrecoverable faults regardless of whether hardware nested
translation is used.

I would suggest that we report DMA unrecoverable faults in all cases,
regardless of whether hardware nested translation is used. This would
make it easier for users to understand the behavior of their systems.

> 
> And I have my doubt at the using the IOPF framework with that
> IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for
> its bottom half workqueue, because a page response could take
> a long cycle. But adding that flag feels like we don't really
> need the bottom half workqueue, i.e. losing the point of using
> the IOPF framework, IMHO.
> 
> Combining the two facts above, I wonder if we really need to
> go through the IOPF framework; can't we just register a user
> fault handler in the iommufd directly upon a valid event_fd?

I agree with you that the existing IOPF framework is not ideal for
IOMMUFD. The adding ASYNC flag conflicts with the IOPF workqueue.
This could lead to performance issues.

I can improve the IOPF framework to make it more friendly to IOMMUFD.
One way to do this would be not use workqueue for the IOMMUFD case.

Have I covered all your concerns?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-05-31  0:33 ` Jason Gunthorpe
@ 2023-05-31  3:17   ` Baolu Lu
  2023-06-23  6:18   ` Baolu Lu
  1 sibling, 0 replies; 37+ messages in thread
From: Baolu Lu @ 2023-05-31  3:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Kevin Tian, Joerg Roedel, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Nicolin Chen, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On 5/31/23 8:33 AM, Jason Gunthorpe wrote:
> On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework. The use case is nested
>> translation, where modern IOMMU hardware supports two-stage translation
>> tables. The second-stage translation table is managed by the host VMM
>> while the first-stage translation table is owned by the user space.
>> Hence, any IO page fault that occurs on the first-stage page table
>> should be delivered to the user space and handled there. The user space
>> should respond the page fault handling result to the device top-down
>> through the IOMMUFD response uAPI.
>>
>> User space indicates its capablity of handling IO page faults by setting
>> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
>> will then setup its infrastructure for page fault delivery. Together
>> with the iopf-capable flag, user space should also provide an eventfd
>> where it will listen on any down-top page fault messages.
>>
>> On a successful return of the allocation of iopf-capable HWPT, a fault
>> fd will be returned. User space can open and read fault messages from it
>> once the eventfd is signaled.
> 
> This is a performance path so we really need to think about this more,
> polling on an eventfd and then reading a different fd is not a good
> design.
> 
> What I would like is to have a design from the start that fits into
> io_uring, so we can have pre-posted 'recvs' in io_uring that just get
> completed at high speed when PRIs come in.
> 
> This suggests that the PRI should be delivered via read() on a single
> FD and pollability on the single FD without any eventfd.

Good suggestion. I will head in this direction.

>> Besides the overall design, I'd like to hear comments about below
>> designs:
>>
>> - The IOMMUFD fault message format. It is very similar to that in
>>    uapi/linux/iommu which has been discussed before and partially used by
>>    the IOMMU SVA implementation. I'd like to get more comments on the
>>    format when it comes to IOMMUFD.
> 
> We have to have the same discussion as always, does a generic fault
> message format make any sense here?
> 
> PRI seems more likely that it would but it needs a big carefull cross
> vendor check out.

Yeah, good point.

As far as I can see, there are at least three types of IOPF hardware
implementation.

- PCI/PRI: Vendors might have their own additions. For example, VT-d 3.0
   allows root-complex integrated endpoints to carry device specific
   private data in their page requests. This has been removed from the
   spec since v4.0.

- DMA stalls.

- Device-specific (non-PRI, not through IOMMU).

Does IOMMUFD want to support the last case?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-05-31  2:10   ` Baolu Lu
@ 2023-05-31  4:12     ` Nicolin Chen
  0 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2023-05-31  4:12 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On Wed, May 31, 2023 at 10:10:15AM +0800, Baolu Lu wrote:

> I agree with you that the existing IOPF framework is not ideal for
> IOMMUFD. The adding ASYNC flag conflicts with the IOPF workqueue.
> This could lead to performance issues.
> 
> I can improve the IOPF framework to make it more friendly to IOMMUFD.
> One way to do this would be not use workqueue for the IOMMUFD case.
> 
> Have I covered all your concerns?

Yea. My concern was mainly at the fault report for non-PRI cases.
Though I am still on the fence about using IOPF framework, let's
see first how the improved design would look like.

Thanks
Nic

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
                   ` (18 preceding siblings ...)
  2023-05-31  0:33 ` Jason Gunthorpe
@ 2023-06-16 11:32 ` Jean-Philippe Brucker
  2023-06-19  3:35   ` Baolu Lu
  2023-06-19 12:58   ` Jason Gunthorpe
  19 siblings, 2 replies; 37+ messages in thread
From: Jean-Philippe Brucker @ 2023-06-16 11:32 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Nicolin Chen, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

Hi Baolu,

On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
> - The timeout value for the pending page fault messages. Ideally we
>   should determine the timeout value from the device configuration, but
>   I failed to find any statement in the PCI specification (version 6.x).
>   A default 100 milliseconds is selected in the implementation, but it
>   leave the room for grow the code for per-device setting.

If it helps we had some discussions about this timeout [1]. It's useful to
print out a warning for debugging, but I don't think completing the fault
on timeout is correct, we should leave the fault pending. Given that the
PCI spec does not indicate a timeout, the guest can wait as long as it
wants to complete the fault (and 100ms may even be reasonable on an
emulator, who knows how many layers and context switches the fault has to
go through).


Another outstanding issue was what to do for PASID stop. When the guest
device driver stops using a PASID it issues a PASID stop request to the
device (a device-specific mechanism). If the device is not using PRI stop
markers it waits for pending PRs to complete and we're fine. Otherwise it
sends a stop marker which is flushed to the PRI queue, but does not wait
for pending PRs.

Handling stop markers is annoying. If the device issues one, then the PRI
queue contains stale faults, a stop marker, followed by valid faults for
the next address space bound to this PASID. The next address space will
get all the spurious faults because the fault handler doesn't know that
there is a stop marker coming. Linux is probably alright with spurious
faults, though maybe not in all cases, and other guests may not support
them at all.

We might need to revisit supporting stop markers: request that each device
driver declares whether their device uses stop markers on unbind() ("This
mechanism must indicate that a Stop Marker Message will be generated."
says the spec, but doesn't say if the function always uses one or the
other mechanism so it's per-unbind). Then we still have to synchronize
unbind() with the fault handler to deal with the pending stop marker,
which might have already gone through or be generated later.

Currently we ignore all that and just flush the PRI queue, followed by the
IOPF queue, to get rid of any stale fault before reassigning the PASID. A
guest however would also need to first flush the HW PRI queue, but doesn't
have a direct way to do that. If we want to support guests that don't deal
with stop markers, the host needs to flush the PRI queue when a PASID is
detached. I guess on Intel detaching the PASID goes through the host which
can flush the host queue. On Arm we'll probably need to flush the queue
when receiving a PASID cache invalidation, which the guest issues after
clearing a PASID table entry.

Thanks,
Jean

[1] https://lore.kernel.org/linux-iommu/20180423153622.GC38106@ostrya.localdomain/
    Also unregistration, not sure if relevant here
    https://lore.kernel.org/linux-iommu/20190605154553.0d00ad8d@jacob-builder/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-16 11:32 ` Jean-Philippe Brucker
@ 2023-06-19  3:35   ` Baolu Lu
  2023-06-26  9:51     ` Jean-Philippe Brucker
  2023-06-19 12:58   ` Jason Gunthorpe
  1 sibling, 1 reply; 37+ messages in thread
From: Baolu Lu @ 2023-06-19  3:35 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: baolu.lu, Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Nicolin Chen, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On 6/16/23 7:32 PM, Jean-Philippe Brucker wrote:
> Hi Baolu,

Hi Jean,

Thank you for the informational reply.

> 
> On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
>> - The timeout value for the pending page fault messages. Ideally we
>>    should determine the timeout value from the device configuration, but
>>    I failed to find any statement in the PCI specification (version 6.x).
>>    A default 100 milliseconds is selected in the implementation, but it
>>    leave the room for grow the code for per-device setting.
> 
> If it helps we had some discussions about this timeout [1]. It's useful to
> print out a warning for debugging, but I don't think completing the fault
> on timeout is correct, we should leave the fault pending. Given that the
> PCI spec does not indicate a timeout, the guest can wait as long as it
> wants to complete the fault (and 100ms may even be reasonable on an
> emulator, who knows how many layers and context switches the fault has to
> go through).

When I was designing this, I was also hesitant about whether to use a
timer. Even worse, I didn't see any description of timeout in the PCI
spec.

I agree with you that a better approach might be to ensure that devices
respect the number of in-flight PPRs that are allocated to them. We need
to design a queue that is large enough to prevent device from flooding
it with page requests.

> 
> Another outstanding issue was what to do for PASID stop. When the guest
> device driver stops using a PASID it issues a PASID stop request to the
> device (a device-specific mechanism). If the device is not using PRI stop
> markers it waits for pending PRs to complete and we're fine. Otherwise it
> sends a stop marker which is flushed to the PRI queue, but does not wait
> for pending PRs.
> 
> Handling stop markers is annoying. If the device issues one, then the PRI
> queue contains stale faults, a stop marker, followed by valid faults for
> the next address space bound to this PASID. The next address space will
> get all the spurious faults because the fault handler doesn't know that
> there is a stop marker coming. Linux is probably alright with spurious
> faults, though maybe not in all cases, and other guests may not support
> them at all.
> 
> We might need to revisit supporting stop markers: request that each device
> driver declares whether their device uses stop markers on unbind() ("This
> mechanism must indicate that a Stop Marker Message will be generated."
> says the spec, but doesn't say if the function always uses one or the
> other mechanism so it's per-unbind). Then we still have to synchronize
> unbind() with the fault handler to deal with the pending stop marker,
> which might have already gone through or be generated later.

I don't quite follow here. Once a PASID is unbound from the device, the
device driver should be free to release the PASID. The PASID could then
be used for any other purpose. The device driver has no idea when the
pending page requests are flushed after unbind(), so it cannot decide
how long should the PASID be delayed for reuse. Therefore, I understand
that a successful return from the unbind() function denotes that all
pending page requests have been flushed and the PASID is viable for
other use.

> 
> Currently we ignore all that and just flush the PRI queue, followed by the
> IOPF queue, to get rid of any stale fault before reassigning the PASID. A
> guest however would also need to first flush the HW PRI queue, but doesn't
> have a direct way to do that. If we want to support guests that don't deal
> with stop markers, the host needs to flush the PRI queue when a PASID is
> detached. I guess on Intel detaching the PASID goes through the host which
> can flush the host queue. On Arm we'll probably need to flush the queue
> when receiving a PASID cache invalidation, which the guest issues after
> clearing a PASID table entry.

The Intel VT-d driver follows below steps to drain pending page requests
when a PASID is unbound from a device.

- Tear down the device's pasid table entry for the stopped pasid.
   This ensures that ATS/PRI will stop putting more page requests for the
   pasid in VT-d PRQ.
- Sync with the PRQ handling thread until all related page requests in
   PRQ have been delivered.
- Flush the iopf queue with iopf_queue_flush_dev().
- Follow the steps defined in VT-d spec section 7.10 to drain all page
   requests and responses between VT-d and the endpoint device.

> 
> Thanks,
> Jean
> 
> [1] https://lore.kernel.org/linux-iommu/20180423153622.GC38106@ostrya.localdomain/
>      Also unregistration, not sure if relevant here
>      https://lore.kernel.org/linux-iommu/20190605154553.0d00ad8d@jacob-builder/
> 

Best regards,
baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-16 11:32 ` Jean-Philippe Brucker
  2023-06-19  3:35   ` Baolu Lu
@ 2023-06-19 12:58   ` Jason Gunthorpe
  1 sibling, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2023-06-19 12:58 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Lu Baolu, Kevin Tian, Joerg Roedel, Will Deacon, Robin Murphy,
	Nicolin Chen, Yi Liu, Jacob Pan, iommu, linux-kselftest,
	virtualization, linux-kernel

On Fri, Jun 16, 2023 at 12:32:32PM +0100, Jean-Philippe Brucker wrote:

> We might need to revisit supporting stop markers: request that each device
> driver declares whether their device uses stop markers on unbind() ("This
> mechanism must indicate that a Stop Marker Message will be generated."
> says the spec, but doesn't say if the function always uses one or the
> other mechanism so it's per-unbind). Then we still have to synchronize
> unbind() with the fault handler to deal with the pending stop marker,
> which might have already gone through or be generated later.

An explicit API to wait for the stop marker makes sense, with the
expectation that well behaved devices will generate it and well
behaved drivers will wait for it.

Things like VFIO should have a way to barrier/drain the PRI queue
after issuing FLR. ie the VMM processing FLR should also barrier the
real HW queues and flush them to VM visibility.

> with stop markers, the host needs to flush the PRI queue when a PASID is
> detached. I guess on Intel detaching the PASID goes through the host which
> can flush the host queue. On Arm we'll probably need to flush the queue
> when receiving a PASID cache invalidation, which the guest issues after
> clearing a PASID table entry.

We are trying to get ARM to a point where invalidations don't need to
be trapped. It would be good to not rely on that anyplace.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-05-31  0:33 ` Jason Gunthorpe
  2023-05-31  3:17   ` Baolu Lu
@ 2023-06-23  6:18   ` Baolu Lu
  2023-06-23 13:50     ` Jason Gunthorpe
  1 sibling, 1 reply; 37+ messages in thread
From: Baolu Lu @ 2023-06-23  6:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Kevin Tian, Joerg Roedel, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Nicolin Chen, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On 5/31/23 8:33 AM, Jason Gunthorpe wrote:
> On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework. The use case is nested
>> translation, where modern IOMMU hardware supports two-stage translation
>> tables. The second-stage translation table is managed by the host VMM
>> while the first-stage translation table is owned by the user space.
>> Hence, any IO page fault that occurs on the first-stage page table
>> should be delivered to the user space and handled there. The user space
>> should respond the page fault handling result to the device top-down
>> through the IOMMUFD response uAPI.
>>
>> User space indicates its capablity of handling IO page faults by setting
>> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
>> will then setup its infrastructure for page fault delivery. Together
>> with the iopf-capable flag, user space should also provide an eventfd
>> where it will listen on any down-top page fault messages.
>>
>> On a successful return of the allocation of iopf-capable HWPT, a fault
>> fd will be returned. User space can open and read fault messages from it
>> once the eventfd is signaled.
> This is a performance path so we really need to think about this more,
> polling on an eventfd and then reading a different fd is not a good
> design.
> 
> What I would like is to have a design from the start that fits into
> io_uring, so we can have pre-posted 'recvs' in io_uring that just get
> completed at high speed when PRIs come in.
> 
> This suggests that the PRI should be delivered via read() on a single
> FD and pollability on the single FD without any eventfd.

I will remove the eventfd and provide a single FD for both read() and
write(). The userspace reads the FD to retrieve the fault messages while
writing the FD to respond the handling of the faults. The user space
could leverage the io_uring for asynchronous I/O. A sample userspace
design could look like this:

[pseudo code for discussion only]

	struct io_uring ring;

	io_uring_setup(IOPF_ENTRIES, &ring);

	while (1) {
		struct io_uring_prep_read read;
		struct io_uring_cqe *cqe;

		read.fd = iopf_fd;
		read.buf = malloc(IOPF_SIZE);
		read.len = IOPF_SIZE;
		read.flags = 0;

		io_uring_prep_read(&ring, &read);
		io_uring_submit(&ring);

		// Wait for the read to complete
		while ((cqe = io_uring_get_cqe(&ring)) != NULL) {
			// Check if the read completed
			if (cqe->res < 0)
				break;

			if (page_fault_read_completion(cqe)) {
				// Get the fault data
				void *data = cqe->buf;
				size_t size = cqe->res;

				// Handle the page fault
				handle_page_fault(data);

				// Respond the fault
				struct io_uring_prep_write write;
				write.fd = iopf_fd;
				write.buf = malloc(IOPF_RESPONSE_SIZE);
				write.len = IOPF_RESPONSE_SIZE;
				write.flags = 0;

				io_uring_prep_write(&ring, &write);
             			io_uring_submit(&ring);
			}

			// Reap the cqe
			io_uring_cqe_free(&ring, cqe);
		}
	}

Did I understand you correctly?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-23  6:18   ` Baolu Lu
@ 2023-06-23 13:50     ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2023-06-23 13:50 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Kevin Tian, Joerg Roedel, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Nicolin Chen, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On Fri, Jun 23, 2023 at 02:18:38PM +0800, Baolu Lu wrote:

> 	struct io_uring ring;
> 
> 	io_uring_setup(IOPF_ENTRIES, &ring);
> 
> 	while (1) {
> 		struct io_uring_prep_read read;
> 		struct io_uring_cqe *cqe;
> 
> 		read.fd = iopf_fd;
> 		read.buf = malloc(IOPF_SIZE);
> 		read.len = IOPF_SIZE;
> 		read.flags = 0;
> 
> 		io_uring_prep_read(&ring, &read);
> 		io_uring_submit(&ring);
> 
> 		// Wait for the read to complete
> 		while ((cqe = io_uring_get_cqe(&ring)) != NULL) {
> 			// Check if the read completed
> 			if (cqe->res < 0)
> 				break;
> 
> 			if (page_fault_read_completion(cqe)) {
> 				// Get the fault data
> 				void *data = cqe->buf;
> 				size_t size = cqe->res;
> 
> 				// Handle the page fault
> 				handle_page_fault(data);
> 
> 				// Respond the fault
> 				struct io_uring_prep_write write;
> 				write.fd = iopf_fd;
> 				write.buf = malloc(IOPF_RESPONSE_SIZE);
> 				write.len = IOPF_RESPONSE_SIZE;
> 				write.flags = 0;
> 
> 				io_uring_prep_write(&ring, &write);
>             			io_uring_submit(&ring);
> 			}
> 
> 			// Reap the cqe
> 			io_uring_cqe_free(&ring, cqe);
> 		}
> 	}
> 
> Did I understand you correctly?

Yes, basically this is the right idea. There are more complex ways to
use the iouring that would be faster still.

And the kernel side can have support to speed it up as well.

I'm wondering if we should be pushing invalidations on io_uring as
well?

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-05-30 18:50 ` [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Nicolin Chen
  2023-05-31  2:10   ` Baolu Lu
@ 2023-06-25  6:30   ` Baolu Lu
  2023-06-25 19:21     ` Nicolin Chen
  2023-06-26 18:33     ` Jason Gunthorpe
  1 sibling, 2 replies; 37+ messages in thread
From: Baolu Lu @ 2023-06-25  6:30 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: baolu.lu, Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On 2023/5/31 2:50, Nicolin Chen wrote:
> Hi Baolu,
> 
> On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
>   
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework. The use case is nested
>> translation, where modern IOMMU hardware supports two-stage translation
>> tables. The second-stage translation table is managed by the host VMM
>> while the first-stage translation table is owned by the user space.
>> Hence, any IO page fault that occurs on the first-stage page table
>> should be delivered to the user space and handled there. The user space
>> should respond the page fault handling result to the device top-down
>> through the IOMMUFD response uAPI.
>>
>> User space indicates its capablity of handling IO page faults by setting
>> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
>> will then setup its infrastructure for page fault delivery. Together
>> with the iopf-capable flag, user space should also provide an eventfd
>> where it will listen on any down-top page fault messages.
>>
>> On a successful return of the allocation of iopf-capable HWPT, a fault
>> fd will be returned. User space can open and read fault messages from it
>> once the eventfd is signaled.
> 
> I think that, whether the guest has an IOPF capability or not,
> the host should always forward any stage-1 fault/error back to
> the guest. Yet, the implementation of this series builds with
> the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV.
> 
> And I have my doubt at the using the IOPF framework with that
> IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for
> its bottom half workqueue, because a page response could take
> a long cycle. But adding that flag feels like we don't really
> need the bottom half workqueue, i.e. losing the point of using
> the IOPF framework, IMHO.
> 
> Combining the two facts above, I wonder if we really need to
> go through the IOPF framework; can't we just register a user
> fault handler in the iommufd directly upon a valid event_fd?

Agreed. We should avoid workqueue in sva iopf framework. Perhaps we
could go ahead with below code? It will be registered to device with
iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling
path. Un-registering in the disable path of cause.

static int io_pgfault_handler(struct iommu_fault *fault, void *cookie)
{
         ioasid_t pasid = fault->prm.pasid;
         struct device *dev = cookie;
         struct iommu_domain *domain;

         if (fault->type != IOMMU_FAULT_PAGE_REQ)
                 return -EOPNOTSUPP;

         if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)
                 domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0);
         else
                 domain = iommu_get_domain_for_dev(dev);

         if (!domain || !domain->iopf_handler)
                 return -ENODEV;

         if (domain->type == IOMMU_DOMAIN_SVA)
                 return iommu_queue_iopf(fault, cookie);

         return domain->iopf_handler(fault, dev, domain->fault_data);
}

Best regards,
baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-25  6:30   ` Baolu Lu
@ 2023-06-25 19:21     ` Nicolin Chen
  2023-06-26  3:10       ` Baolu Lu
  2023-06-26 18:33     ` Jason Gunthorpe
  1 sibling, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2023-06-25 19:21 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2023/5/31 2:50, Nicolin Chen wrote:
> > Hi Baolu,
> > 
> > On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
> > 
> > > This series implements the functionality of delivering IO page faults to
> > > user space through the IOMMUFD framework. The use case is nested
> > > translation, where modern IOMMU hardware supports two-stage translation
> > > tables. The second-stage translation table is managed by the host VMM
> > > while the first-stage translation table is owned by the user space.
> > > Hence, any IO page fault that occurs on the first-stage page table
> > > should be delivered to the user space and handled there. The user space
> > > should respond the page fault handling result to the device top-down
> > > through the IOMMUFD response uAPI.
> > > 
> > > User space indicates its capablity of handling IO page faults by setting
> > > a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
> > > will then setup its infrastructure for page fault delivery. Together
> > > with the iopf-capable flag, user space should also provide an eventfd
> > > where it will listen on any down-top page fault messages.
> > > 
> > > On a successful return of the allocation of iopf-capable HWPT, a fault
> > > fd will be returned. User space can open and read fault messages from it
> > > once the eventfd is signaled.
> > 
> > I think that, whether the guest has an IOPF capability or not,
> > the host should always forward any stage-1 fault/error back to
> > the guest. Yet, the implementation of this series builds with
> > the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV.
> > 
> > And I have my doubt at the using the IOPF framework with that
> > IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for
> > its bottom half workqueue, because a page response could take
> > a long cycle. But adding that flag feels like we don't really
> > need the bottom half workqueue, i.e. losing the point of using
> > the IOPF framework, IMHO.
> > 
> > Combining the two facts above, I wonder if we really need to
> > go through the IOPF framework; can't we just register a user
> > fault handler in the iommufd directly upon a valid event_fd?
> 
> Agreed. We should avoid workqueue in sva iopf framework. Perhaps we
> could go ahead with below code? It will be registered to device with
> iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling
> path. Un-registering in the disable path of cause.

Well, for a virtualization use case, I still think it's should
be registered in iommufd. Having a device without an IOPF/PRI
capability, a guest OS should receive some faults too, if that
device causes a translation failure.

And for a vSVA use case, the IOMMU_DEV_FEAT_IOPF feature only
gets enabled in the guest VM right? How could the host enable
the IOMMU_DEV_FEAT_IOPF to trigger this handler?

Thanks
Nic

> static int io_pgfault_handler(struct iommu_fault *fault, void *cookie)
> {
>         ioasid_t pasid = fault->prm.pasid;
>         struct device *dev = cookie;
>         struct iommu_domain *domain;
> 
>         if (fault->type != IOMMU_FAULT_PAGE_REQ)
>                 return -EOPNOTSUPP;
> 
>         if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)
>                 domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0);
>         else
>                 domain = iommu_get_domain_for_dev(dev);
> 
>         if (!domain || !domain->iopf_handler)
>                 return -ENODEV;
> 
>         if (domain->type == IOMMU_DOMAIN_SVA)
>                 return iommu_queue_iopf(fault, cookie);
> 
>         return domain->iopf_handler(fault, dev, domain->fault_data);
> }
> 
> Best regards,
> baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-25 19:21     ` Nicolin Chen
@ 2023-06-26  3:10       ` Baolu Lu
  2023-06-26 18:02         ` Nicolin Chen
  0 siblings, 1 reply; 37+ messages in thread
From: Baolu Lu @ 2023-06-26  3:10 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: baolu.lu, Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On 6/26/23 3:21 AM, Nicolin Chen wrote:
> On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2023/5/31 2:50, Nicolin Chen wrote:
>>> Hi Baolu,
>>>
>>> On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
>>>
>>>> This series implements the functionality of delivering IO page faults to
>>>> user space through the IOMMUFD framework. The use case is nested
>>>> translation, where modern IOMMU hardware supports two-stage translation
>>>> tables. The second-stage translation table is managed by the host VMM
>>>> while the first-stage translation table is owned by the user space.
>>>> Hence, any IO page fault that occurs on the first-stage page table
>>>> should be delivered to the user space and handled there. The user space
>>>> should respond the page fault handling result to the device top-down
>>>> through the IOMMUFD response uAPI.
>>>>
>>>> User space indicates its capablity of handling IO page faults by setting
>>>> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
>>>> will then setup its infrastructure for page fault delivery. Together
>>>> with the iopf-capable flag, user space should also provide an eventfd
>>>> where it will listen on any down-top page fault messages.
>>>>
>>>> On a successful return of the allocation of iopf-capable HWPT, a fault
>>>> fd will be returned. User space can open and read fault messages from it
>>>> once the eventfd is signaled.
>>> I think that, whether the guest has an IOPF capability or not,
>>> the host should always forward any stage-1 fault/error back to
>>> the guest. Yet, the implementation of this series builds with
>>> the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV.
>>>
>>> And I have my doubt at the using the IOPF framework with that
>>> IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for
>>> its bottom half workqueue, because a page response could take
>>> a long cycle. But adding that flag feels like we don't really
>>> need the bottom half workqueue, i.e. losing the point of using
>>> the IOPF framework, IMHO.
>>>
>>> Combining the two facts above, I wonder if we really need to
>>> go through the IOPF framework; can't we just register a user
>>> fault handler in the iommufd directly upon a valid event_fd?
>> Agreed. We should avoid workqueue in sva iopf framework. Perhaps we
>> could go ahead with below code? It will be registered to device with
>> iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling
>> path. Un-registering in the disable path of cause.
> Well, for a virtualization use case, I still think it's should
> be registered in iommufd.

Emm.. you suggest iommufd calls iommu_register_device_fault_handler() to
register its own page fault handler, right?

I have a different opinion, iommu_register_device_fault_handler() is
called to register a fault handler for a device. It should be called
or initiated by a device driver. The iommufd only needs to install a
per-domain io page fault handler.

I am considering a use case on Intel platform. Perhaps it's similar
on other platforms. An SIOV-capable device can support host SVA and
assigning mediated devices to user space at the same time. Both host
SVA and mediated devices require IOPF. So there will be multiple places
where a page fault handler needs to be registered.

> Having a device without an IOPF/PRI
> capability, a guest OS should receive some faults too, if that
> device causes a translation failure.

Yes. DMA faults are also a consideration. But I would like to have it
supported in a separated series. As I explained in the previous reply,
we also need to consider the software nested translation case.

> 
> And for a vSVA use case, the IOMMU_DEV_FEAT_IOPF feature only
> gets enabled in the guest VM right? How could the host enable
> the IOMMU_DEV_FEAT_IOPF to trigger this handler?

As mentioned above, this should be initiated by the kernel device
driver, vfio or possible mediated device driver.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-19  3:35   ` Baolu Lu
@ 2023-06-26  9:51     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 37+ messages in thread
From: Jean-Philippe Brucker @ 2023-06-26  9:51 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Nicolin Chen, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On Mon, Jun 19, 2023 at 11:35:50AM +0800, Baolu Lu wrote:
> > Another outstanding issue was what to do for PASID stop. When the guest
> > device driver stops using a PASID it issues a PASID stop request to the
> > device (a device-specific mechanism). If the device is not using PRI stop
> > markers it waits for pending PRs to complete and we're fine. Otherwise it
> > sends a stop marker which is flushed to the PRI queue, but does not wait
> > for pending PRs.
> > 
> > Handling stop markers is annoying. If the device issues one, then the PRI
> > queue contains stale faults, a stop marker, followed by valid faults for
> > the next address space bound to this PASID. The next address space will
> > get all the spurious faults because the fault handler doesn't know that
> > there is a stop marker coming. Linux is probably alright with spurious
> > faults, though maybe not in all cases, and other guests may not support
> > them at all.
> > 
> > We might need to revisit supporting stop markers: request that each device
> > driver declares whether their device uses stop markers on unbind() ("This
> > mechanism must indicate that a Stop Marker Message will be generated."
> > says the spec, but doesn't say if the function always uses one or the
> > other mechanism so it's per-unbind). Then we still have to synchronize
> > unbind() with the fault handler to deal with the pending stop marker,
> > which might have already gone through or be generated later.
> 
> I don't quite follow here. Once a PASID is unbound from the device, the
> device driver should be free to release the PASID. The PASID could then
> be used for any other purpose. The device driver has no idea when the
> pending page requests are flushed after unbind(), so it cannot decide
> how long should the PASID be delayed for reuse. Therefore, I understand
> that a successful return from the unbind() function denotes that all
> pending page requests have been flushed and the PASID is viable for
> other use.

Yes that's the contract for unbind() at the moment

> 
> > 
> > Currently we ignore all that and just flush the PRI queue, followed by the
> > IOPF queue, to get rid of any stale fault before reassigning the PASID. A
> > guest however would also need to first flush the HW PRI queue, but doesn't
> > have a direct way to do that. If we want to support guests that don't deal
> > with stop markers, the host needs to flush the PRI queue when a PASID is
> > detached. I guess on Intel detaching the PASID goes through the host which
> > can flush the host queue. On Arm we'll probably need to flush the queue
> > when receiving a PASID cache invalidation, which the guest issues after
> > clearing a PASID table entry.
> 
> The Intel VT-d driver follows below steps to drain pending page requests
> when a PASID is unbound from a device.
> 
> - Tear down the device's pasid table entry for the stopped pasid.
>   This ensures that ATS/PRI will stop putting more page requests for the
>   pasid in VT-d PRQ.

Oh that's interesting, I didn't know about the implicit TLB invalidations
on page requests for VT-d.

For Arm SMMU, clearing the PASID table entry does cause ATS Translation
Requests to return with Completer Abort, but does not affect PRI. The SMMU
pushes page requests directly into the PRI queue without reading any table
(unless the queue overflows).

We're counting on the device driver to perform the PASID stop request
before calling unbind(), described in PCIe 6.20.1 (Managing PASID Usage)
and 10.4.1.2 (Managing PASID Usage on PRG Requests). This ensures that
when unbind() is called, no more page request for the PASID is pushed into
the PRI queue. But some may still be in the queue if the device uses stop
markers.

> - Sync with the PRQ handling thread until all related page requests in
>   PRQ have been delivered.

This is what I'm concerned about. For VT-d this happens in the host which
is in charge of modifying the PASID table. For SMMU, the guest writes the
PASID table. It flushes its virtual PRI queue, but not the physical queue
that is managed by the host.

One synchronization point where the host could flush the physical PRI
queue is the PASID config invalidation (CMD_CFGI_CD). As Jason pointed out
the host may not be able to observe those if a command queue is assigned
directly to the guest (a theoretical SMMU extension), though in that case
the guest may also have direct access to a PRI queue (like the AMD vIOMMU
extension) and be able to flush the queue directly.

But we can just wait for PRI implementations and see what the drivers
need. Maybe no device will implement stop markers.

Thanks,
Jean

> - Flush the iopf queue with iopf_queue_flush_dev().
> - Follow the steps defined in VT-d spec section 7.10 to drain all page
>   requests and responses between VT-d and the endpoint device.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-26  3:10       ` Baolu Lu
@ 2023-06-26 18:02         ` Nicolin Chen
  0 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2023-06-26 18:02 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Jason Gunthorpe, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On Mon, Jun 26, 2023 at 11:10:22AM +0800, Baolu Lu wrote:

> > > > I think that, whether the guest has an IOPF capability or not,
> > > > the host should always forward any stage-1 fault/error back to
> > > > the guest. Yet, the implementation of this series builds with
> > > > the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV.
> > > > 
> > > > And I have my doubt at the using the IOPF framework with that
> > > > IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for
> > > > its bottom half workqueue, because a page response could take
> > > > a long cycle. But adding that flag feels like we don't really
> > > > need the bottom half workqueue, i.e. losing the point of using
> > > > the IOPF framework, IMHO.
> > > > 
> > > > Combining the two facts above, I wonder if we really need to
> > > > go through the IOPF framework; can't we just register a user
> > > > fault handler in the iommufd directly upon a valid event_fd?
> > > Agreed. We should avoid workqueue in sva iopf framework. Perhaps we
> > > could go ahead with below code? It will be registered to device with
> > > iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling
> > > path. Un-registering in the disable path of cause.
> > Well, for a virtualization use case, I still think it's should
> > be registered in iommufd.
> 
> Emm.. you suggest iommufd calls iommu_register_device_fault_handler() to
> register its own page fault handler, right?
> 
> I have a different opinion, iommu_register_device_fault_handler() is
> called to register a fault handler for a device. It should be called
> or initiated by a device driver. The iommufd only needs to install a
> per-domain io page fault handler.
> 
> I am considering a use case on Intel platform. Perhaps it's similar
> on other platforms. An SIOV-capable device can support host SVA and
> assigning mediated devices to user space at the same time. Both host
> SVA and mediated devices require IOPF. So there will be multiple places
> where a page fault handler needs to be registered.

Okay, the narrative makes sense to me. I was more thinking of
the nesting case. The iommu_register_device_fault_handler() is
registered per device, as its name implies, while the handler
probably should be slightly different by the attaching domain.

It seems that your io_pgfault_handler() in the previous email
can potentially handle this, i.e. a IOMMU_DOMAIN_NESTED could
set domain->iopf_handler to forward DMA faults to user space.
We just need to make sure this pathway would be unconditional
at the handler registration and fault->type.

> > Having a device without an IOPF/PRI
> > capability, a guest OS should receive some faults too, if that
> > device causes a translation failure.
> 
> Yes. DMA faults are also a consideration. But I would like to have it
> supported in a separated series. As I explained in the previous reply,
> we also need to consider the software nested translation case.

I see.

Thanks
Nic

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-25  6:30   ` Baolu Lu
  2023-06-25 19:21     ` Nicolin Chen
@ 2023-06-26 18:33     ` Jason Gunthorpe
  2023-06-28  2:00       ` Baolu Lu
  1 sibling, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2023-06-26 18:33 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Nicolin Chen, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote:

> Agreed. We should avoid workqueue in sva iopf framework. Perhaps we
> could go ahead with below code? It will be registered to device with
> iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling
> path. Un-registering in the disable path of cause.

This maze needs to be undone as well.

It makes no sense that all the drivers are calling 

 iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);

The driver should RX a PRI fault and deliver it to some core code
function, this looks like a good start:

> static int io_pgfault_handler(struct iommu_fault *fault, void *cookie)
> {
>         ioasid_t pasid = fault->prm.pasid;
>         struct device *dev = cookie;
>         struct iommu_domain *domain;
> 
>         if (fault->type != IOMMU_FAULT_PAGE_REQ)
>                 return -EOPNOTSUPP;
> 
>         if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)
>                 domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0);
>         else
>                 domain = iommu_get_domain_for_dev(dev);
> 
>         if (!domain || !domain->iopf_handler)
>                 return -ENODEV;
> 
>         if (domain->type == IOMMU_DOMAIN_SVA)
>                 return iommu_queue_iopf(fault, cookie);
> 
>         return domain->iopf_handler(fault, dev, domain->fault_data);

Then we find the domain that owns the translation and invoke its
domain->ops->iopf_handler()

If the driver created a SVA domain then the op should point to some
generic 'handle sva fault' function. There shouldn't be weird SVA
stuff in the core code.

The weird SVA stuff is really just a generic per-device workqueue
dispatcher, so if we think that is valuable then it should be
integrated into the iommu_domain (domain->ops->use_iopf_workqueue =
true for instance). Then it could route the fault through the
workqueue and still invoke domain->ops->iopf_handler.

The word "SVA" should not appear in any of this.

Not sure what iommu_register_device_fault_handler() has to do with all
of this.. Setting up the dev_iommu stuff to allow for the workqueue
should happen dynamically during domain attach, ideally in the core
code before calling to the driver.

Also, I can understand there is a need to turn on PRI support really
early, and it can make sense to have some IOMMU_DEV_FEAT_IOPF/SVA to
ask to turn it on.. But that should really only be needed if the HW
cannot turn it on dynamically during domain attach of a PRI enabled
domain.

It needs cleaning up..

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-26 18:33     ` Jason Gunthorpe
@ 2023-06-28  2:00       ` Baolu Lu
  2023-06-28 12:49         ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Baolu Lu @ 2023-06-28  2:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Nicolin Chen, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On 2023/6/27 2:33, Jason Gunthorpe wrote:
> On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote:
> 
>> Agreed. We should avoid workqueue in sva iopf framework. Perhaps we
>> could go ahead with below code? It will be registered to device with
>> iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling
>> path. Un-registering in the disable path of cause.
> 
> This maze needs to be undone as well.
> 
> It makes no sense that all the drivers are calling
> 
>   iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);
> 
> The driver should RX a PRI fault and deliver it to some core code
> function, this looks like a good start:
> 
>> static int io_pgfault_handler(struct iommu_fault *fault, void *cookie)
>> {
>>          ioasid_t pasid = fault->prm.pasid;
>>          struct device *dev = cookie;
>>          struct iommu_domain *domain;
>>
>>          if (fault->type != IOMMU_FAULT_PAGE_REQ)
>>                  return -EOPNOTSUPP;
>>
>>          if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)
>>                  domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0);
>>          else
>>                  domain = iommu_get_domain_for_dev(dev);
>>
>>          if (!domain || !domain->iopf_handler)
>>                  return -ENODEV;
>>
>>          if (domain->type == IOMMU_DOMAIN_SVA)
>>                  return iommu_queue_iopf(fault, cookie);
>>
>>          return domain->iopf_handler(fault, dev, domain->fault_data);
> 
> Then we find the domain that owns the translation and invoke its
> domain->ops->iopf_handler()

Agreed. The iommu_register_device_fault_handler() could only be called
by the device drivers who want to handle the DMA faults and IO page
faults by themselves in any special ways.

By default, the faults should be dispatched to domain->iopf_handler in a
generic core code.

> 
> If the driver created a SVA domain then the op should point to some
> generic 'handle sva fault' function. There shouldn't be weird SVA
> stuff in the core code.
> 
> The weird SVA stuff is really just a generic per-device workqueue
> dispatcher, so if we think that is valuable then it should be
> integrated into the iommu_domain (domain->ops->use_iopf_workqueue =
> true for instance). Then it could route the fault through the
> workqueue and still invoke domain->ops->iopf_handler.
> 
> The word "SVA" should not appear in any of this.

Yes. We should make it generic. The domain->use_iopf_workqueue flag
denotes that the page faults of a fault group should be put together and
then be handled and responded in a workqueue. Otherwise, the page fault
is dispatched to domain->iopf_handler directly.

> 
> Not sure what iommu_register_device_fault_handler() has to do with all
> of this.. Setting up the dev_iommu stuff to allow for the workqueue
> should happen dynamically during domain attach, ideally in the core
> code before calling to the driver.

There are two pointers under struct dev_iommu for fault handling.

/**
  * struct dev_iommu - Collection of per-device IOMMU data
  *
  * @fault_param: IOMMU detected device fault reporting data
  * @iopf_param:  I/O Page Fault queue and data

[...]

struct dev_iommu {
         struct mutex lock;
         struct iommu_fault_param        *fault_param;
         struct iopf_device_param        *iopf_param;

My understanding is that @fault_param is a place holder for generic
things, while @iopf_param is workqueue specific. Perhaps we could make
@fault_param static and initialize it during iommu device_probe, as
IOMMU fault is generic on every device managed by an IOMMU.

@iopf_param could be allocated on demand. (perhaps renaming it to a more
meaningful one?) It happens before a domain with use_iopf_workqueue flag
set attaches to a device. iopf_param keeps alive until device_release.

> 
> Also, I can understand there is a need to turn on PRI support really
> early, and it can make sense to have some IOMMU_DEV_FEAT_IOPF/SVA to
> ask to turn it on.. But that should really only be needed if the HW
> cannot turn it on dynamically during domain attach of a PRI enabled
> domain.
> 
> It needs cleaning up..

Yes. I can put this and other cleanup things that we've discussed in a
preparation series and send it out for review after the next rc1 is
released.

> 
> Jason

Best regards,
baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-28  2:00       ` Baolu Lu
@ 2023-06-28 12:49         ` Jason Gunthorpe
  2023-06-29  1:07           ` Baolu Lu
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2023-06-28 12:49 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Nicolin Chen, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On Wed, Jun 28, 2023 at 10:00:56AM +0800, Baolu Lu wrote:
> > If the driver created a SVA domain then the op should point to some
> > generic 'handle sva fault' function. There shouldn't be weird SVA
> > stuff in the core code.
> > 
> > The weird SVA stuff is really just a generic per-device workqueue
> > dispatcher, so if we think that is valuable then it should be
> > integrated into the iommu_domain (domain->ops->use_iopf_workqueue =
> > true for instance). Then it could route the fault through the
> > workqueue and still invoke domain->ops->iopf_handler.
> > 
> > The word "SVA" should not appear in any of this.
> 
> Yes. We should make it generic. The domain->use_iopf_workqueue flag
> denotes that the page faults of a fault group should be put together and
> then be handled and responded in a workqueue. Otherwise, the page fault
> is dispatched to domain->iopf_handler directly.

It might be better to have iopf_handler and
iopf_handler_work function pointers to distinguish to two cases.

> > Not sure what iommu_register_device_fault_handler() has to do with all
> > of this.. Setting up the dev_iommu stuff to allow for the workqueue
> > should happen dynamically during domain attach, ideally in the core
> > code before calling to the driver.
> 
> There are two pointers under struct dev_iommu for fault handling.
> 
> /**
>  * struct dev_iommu - Collection of per-device IOMMU data
>  *
>  * @fault_param: IOMMU detected device fault reporting data
>  * @iopf_param:  I/O Page Fault queue and data
> 
> [...]
> 
> struct dev_iommu {
>         struct mutex lock;
>         struct iommu_fault_param        *fault_param;
>         struct iopf_device_param        *iopf_param;
> 
> My understanding is that @fault_param is a place holder for generic
> things, while @iopf_param is workqueue specific.

Well, lets look

struct iommu_fault_param {
	iommu_dev_fault_handler_t handler;
	void *data;

These two make no sense now. handler is always iommu_queue_iopf. Given
our domain centric design we want the function pointer in the domain,
not in the device. So delete it.

	struct list_head faults;
	struct mutex lock;

Queue of unhandled/unacked faults? Seems sort of reasonable

> @iopf_param could be allocated on demand. (perhaps renaming it to a more
> meaningful one?) It happens before a domain with use_iopf_workqueue flag
> set attaches to a device. iopf_param keeps alive until device_release.

Yes

Do this for the iommu_fault_param as well, in fact, probably just put
the two things together in one allocation and allocate if we attach a
PRI using domain. I don't think we need to micro optimze further..
 
Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space
  2023-06-28 12:49         ` Jason Gunthorpe
@ 2023-06-29  1:07           ` Baolu Lu
  0 siblings, 0 replies; 37+ messages in thread
From: Baolu Lu @ 2023-06-29  1:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Nicolin Chen, Kevin Tian, Joerg Roedel, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Yi Liu, Jacob Pan, iommu,
	linux-kselftest, virtualization, linux-kernel

On 2023/6/28 20:49, Jason Gunthorpe wrote:
> On Wed, Jun 28, 2023 at 10:00:56AM +0800, Baolu Lu wrote:
>>> If the driver created a SVA domain then the op should point to some
>>> generic 'handle sva fault' function. There shouldn't be weird SVA
>>> stuff in the core code.
>>>
>>> The weird SVA stuff is really just a generic per-device workqueue
>>> dispatcher, so if we think that is valuable then it should be
>>> integrated into the iommu_domain (domain->ops->use_iopf_workqueue =
>>> true for instance). Then it could route the fault through the
>>> workqueue and still invoke domain->ops->iopf_handler.
>>>
>>> The word "SVA" should not appear in any of this.
>>
>> Yes. We should make it generic. The domain->use_iopf_workqueue flag
>> denotes that the page faults of a fault group should be put together and
>> then be handled and responded in a workqueue. Otherwise, the page fault
>> is dispatched to domain->iopf_handler directly.
> 
> It might be better to have iopf_handler and
> iopf_handler_work function pointers to distinguish to two cases.

Both are okay. Let's choose one when we have the code.

> 
>>> Not sure what iommu_register_device_fault_handler() has to do with all
>>> of this.. Setting up the dev_iommu stuff to allow for the workqueue
>>> should happen dynamically during domain attach, ideally in the core
>>> code before calling to the driver.
>>
>> There are two pointers under struct dev_iommu for fault handling.
>>
>> /**
>>   * struct dev_iommu - Collection of per-device IOMMU data
>>   *
>>   * @fault_param: IOMMU detected device fault reporting data
>>   * @iopf_param:  I/O Page Fault queue and data
>>
>> [...]
>>
>> struct dev_iommu {
>>          struct mutex lock;
>>          struct iommu_fault_param        *fault_param;
>>          struct iopf_device_param        *iopf_param;
>>
>> My understanding is that @fault_param is a place holder for generic
>> things, while @iopf_param is workqueue specific.
> 
> Well, lets look
> 
> struct iommu_fault_param {
> 	iommu_dev_fault_handler_t handler;
> 	void *data;
> 
> These two make no sense now. handler is always iommu_queue_iopf. Given
> our domain centric design we want the function pointer in the domain,
> not in the device. So delete it.

Agreed.

> 
> 	struct list_head faults;
> 	struct mutex lock;
> 
> Queue of unhandled/unacked faults? Seems sort of reasonable

It's the list of faults pending for response.

>> @iopf_param could be allocated on demand. (perhaps renaming it to a more
>> meaningful one?) It happens before a domain with use_iopf_workqueue flag
>> set attaches to a device. iopf_param keeps alive until device_release.
> 
> Yes
> 
> Do this for the iommu_fault_param as well, in fact, probably just put
> the two things together in one allocation and allocate if we attach a
> PRI using domain. I don't think we need to micro optimze further..

Yeah, let me try this.

Best regards,
baolu


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-06-29  1:07 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-30  5:37 [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 01/17] iommu: Move iommu fault data to linux/iommu.h Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 02/17] iommu: Support asynchronous I/O page fault response Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 03/17] iommu: Add helper to set iopf handler for domain Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 04/17] iommu: Pass device parameter to iopf handler Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 05/17] iommu: Split IO page fault handling from SVA Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 06/17] iommu: Add iommu page fault cookie helpers Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 07/17] iommufd: Add iommu page fault data Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 08/17] iommufd: IO page fault delivery initialization and release Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 09/17] iommufd: Add iommufd hwpt iopf handler Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 10/17] iommufd: Add IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE for hwpt_alloc Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 11/17] iommufd: Deliver fault messages to user space Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 12/17] iommufd: Add io page fault response support Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 13/17] iommufd: Add a timer for each iommufd fault data Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 14/17] iommufd: Drain all pending faults when destroying hwpt Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 15/17] iommufd: Allow new hwpt_alloc flags Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 16/17] iommufd/selftest: Add IOPF feature for mock devices Lu Baolu
2023-05-30  5:37 ` [RFC PATCHES 17/17] iommufd/selftest: Cover iopf-capable nested hwpt Lu Baolu
2023-05-30 18:50 ` [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Nicolin Chen
2023-05-31  2:10   ` Baolu Lu
2023-05-31  4:12     ` Nicolin Chen
2023-06-25  6:30   ` Baolu Lu
2023-06-25 19:21     ` Nicolin Chen
2023-06-26  3:10       ` Baolu Lu
2023-06-26 18:02         ` Nicolin Chen
2023-06-26 18:33     ` Jason Gunthorpe
2023-06-28  2:00       ` Baolu Lu
2023-06-28 12:49         ` Jason Gunthorpe
2023-06-29  1:07           ` Baolu Lu
2023-05-31  0:33 ` Jason Gunthorpe
2023-05-31  3:17   ` Baolu Lu
2023-06-23  6:18   ` Baolu Lu
2023-06-23 13:50     ` Jason Gunthorpe
2023-06-16 11:32 ` Jean-Philippe Brucker
2023-06-19  3:35   ` Baolu Lu
2023-06-26  9:51     ` Jean-Philippe Brucker
2023-06-19 12:58   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).